基于形式概念分析的聚焦爬蟲算法
發(fā)布時間:2018-04-27 07:20
本文選題:形式概念分析 + 概念格; 參考:《中央民族大學(xué)》2013年碩士論文
【摘要】:移動互聯(lián)網(wǎng)的迅速增長使得搜索引擎面臨巨大的挑戰(zhàn),搜索引擎如何適應(yīng)這種變化以及如何提供更優(yōu)質(zhì)的檢索服務(wù)成為了一個備受關(guān)注的問題,作為其重要組成部分的網(wǎng)絡(luò)爬蟲算法成為人們研究的熱點。通用網(wǎng)絡(luò)爬蟲由于爬行的規(guī)模較大,爬行頁面內(nèi)容比較雜亂,不能滿足用戶對于特定信息以及興趣主題的集中爬行。面向主題的網(wǎng)絡(luò)爬蟲可以有選擇的爬行與主題相關(guān)的網(wǎng)頁,有效的減少了爬行頁面的數(shù)量,而且提高了抓取的準確度并滿足了用戶對特定主題的搜索需求。 形式概念分析是一種基于概念格的數(shù)據(jù)分析方法,自從形式概念分析理論提出以來,它就因為知識表示的直觀、簡潔等特點受到研究者的廣泛關(guān)注,已經(jīng)在軟件工程、圖書館和信息科學(xué)、數(shù)據(jù)挖掘等諸多領(lǐng)域得到了廣泛的應(yīng)用。 本文通過研究現(xiàn)有主題爬蟲的原理,提出了將形式概念分析這一數(shù)據(jù)分析工具應(yīng)用到主題爬蟲的有關(guān)算法中,將概念格應(yīng)用到主題相關(guān)性分析以及排序算法,從而改進了爬蟲的相關(guān)算法。本文的研究工作主要有: 首先,本文通過對形式概念分析理論的學(xué)習,認真研究了其核心概念格上概念間的關(guān)系以及概念格的結(jié)構(gòu),聯(lián)想到將概念格融入到主題爬蟲的算法中。 其次,重點研究了主題爬蟲的原理,包括對其結(jié)構(gòu),搜索策略,pagerank排序算法和主題相關(guān)度的研究,改進了基于概念格的主題相關(guān)度算法并將其用來計算爬蟲的主題相關(guān)度。分析了pagerank排序算法的缺陷,并在此基礎(chǔ)上結(jié)合概念格提出了改進的pagerank算法。
[Abstract]:The rapid growth of the mobile Internet makes search engines face enormous challenges. How search engines adapt to this change and how to provide better search services has become a problem of great concern. As an important part of the network crawler algorithm has become a hot topic. Because of the large scale of crawling and the cluttered content of crawling pages, general web crawlers can not satisfy the concentration of users' crawling for specific information and topics of interest. Topic-oriented web crawlers can selectively crawl theme-related pages, effectively reduce the number of crawling pages, and improve the accuracy of crawling and meet the search needs of users for specific topics. Formal conceptual analysis is a data analysis method based on concept lattice. Since the theory of formal conceptual analysis was put forward, it has been widely concerned by researchers for its intuitive and concise knowledge representation, and has been widely used in software engineering. Library and information science, data mining and many other fields have been widely used. In this paper, by studying the principle of topic crawler, we propose to apply formal concept analysis, which is a data analysis tool, to the algorithm of topic crawler, and to apply concept lattice to topic correlation analysis and sorting algorithm. The algorithm of reptile is improved. The main research work of this paper is as follows: Firstly, by studying the formal conceptual analysis theory, this paper studies the relationship between the concepts on the core concept lattice and the structure of the concept lattice, associating the concept lattice with the algorithm of topic crawler. Secondly, the principle of topic crawler is studied, including its structure, search strategy pagerank sorting algorithm and topic correlation degree. The topic correlation algorithm based on concept lattice is improved and used to calculate the topic correlation of crawler. The defects of pagerank sorting algorithm are analyzed, and an improved pagerank algorithm is proposed based on concept lattice.
【學(xué)位授予單位】:中央民族大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2013
【分類號】:TP391.3
【參考文獻】
相關(guān)期刊論文 前4條
1 李鴻儒;魏平;;基于不可約元的概念格屬性特征識別方法[J];計算機科學(xué);2006年06期
2 胡健;楊炳儒;;增量式廣義概念格結(jié)構(gòu)的生成算法研究與實現(xiàn)[J];計算機科學(xué);2009年05期
3 楊炳儒,李巖,陳新中,王霞;Web結(jié)構(gòu)挖掘[J];計算機工程;2003年20期
4 汪濤,樊孝忠;主題爬蟲的設(shè)計與實現(xiàn)[J];計算機應(yīng)用;2004年S1期
相關(guān)碩士學(xué)位論文 前3條
1 董占兵;基于形式概念分析的主題搜索策略研究[D];西華大學(xué);2007年
2 王瑩煜;基于多Agent系統(tǒng)的主題爬蟲理解與協(xié)作研究[D];西華大學(xué);2010年
3 王凱;基于概念格的領(lǐng)域本體概念相似度提取方法研究[D];安徽農(nóng)業(yè)大學(xué);2011年
,本文編號:1809776
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1809776.html
最近更新
教材專著