天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當前位置:主頁 > 科技論文 > 搜索引擎論文 >

垂直搜索引擎中主題爬行技術的研究

發(fā)布時間:2018-04-15 02:15

  本文選題:主題爬行 + 維基百科 ; 參考:《重慶大學》2012年碩士論文


【摘要】:隨著互聯(lián)網技術的飛速發(fā)展,傳統(tǒng)的通用搜索引擎逐漸暴露出了覆蓋率低、結果不準確等弊端。為了滿足用戶精確搜索的需求,垂直搜索引擎應運而生。它利用主題爬行技術來搜集Web中與某個領域(主題)相關的網頁,并提供面向該領域的檢索服務。無疑,主題爬行技術是垂直搜索引擎的核心部分,直接影響著垂直搜索引擎的性能。本文重點研究了主題描述、候選鏈接優(yōu)先級的預測和自適應的爬行策略等主題爬行中的關鍵技術,主要內容包括: (1)提出了一種基于維基百科的主題描述方法。對主題進行清晰、準確的描述是主題爬行器的基礎,主題的描述方式也決定了主題相關性的計算方式,F(xiàn)有的算法多基于特征集來描述主題,并通過特征詞的機械匹配來計算主題的相關性,但它不僅忽視了特征詞之間的語義關系,而且使得特征詞分布過于稀疏,,降低了對主題的描述性;也有一些方法引入了本體或語義詞典來分析詞語之間的語義關聯(lián),但現(xiàn)有的本體很少,而語義詞典多存在著開放性差、詞匯量有限、更新不及時的缺點。針對這些不足,本文將易于獲取、更新及時、描述客觀的維基百科作為背景知識,根據(jù)分類樹來構建主題向量空間,并將主題描述文檔映射成向量來描述主題,并且在相關性計算時引入了語義分析;同時,利用消歧參照表來解決詞語映射到概念的過程中映射不符合實際或一詞多義的問題。實驗表明,該方法比傳統(tǒng)方法在信息量總和及查準率上均有顯著提高。 (2)提出了一種基于網頁分塊的候選鏈接優(yōu)先級的預測方法。候選鏈接的優(yōu)先級預測決定了主題爬行的方向和結果,現(xiàn)有算法多根據(jù)頁面內容、錨文本和錨文本上下文來預測候選鏈接的優(yōu)先級,但頁面中含有廣告等噪音數(shù)據(jù),錨文本上下文難以界定,錨文本包含的信息量也很有限。因此,本文首先基于深度優(yōu)先遍歷對網頁進行分塊,過濾掉了部分噪音節(jié)點,再從網頁內容文本、塊文本和錨文本三個方面綜合預測候選鏈接的優(yōu)先級。實驗表明,引入網頁分塊有效改善了主題爬行的性能。 (3)提出了基于信息增益和基于信息量總和比率的兩種自適應方法。由于根據(jù)分類樹的概念層次體系所獲得的主題初始描述往往不夠客觀和準確,所以本文在每爬行一定數(shù)量的網頁后,就根據(jù)兩種自適應方法對已爬行的所有網頁重新計算并自動反饋更新主題向量空間中每個概念的權重,從而完善主題描述。實驗表明,兩者都實現(xiàn)了主題的增量爬行;引入基于信息增益的自適應方法后爬取的網頁比引入基于信息量總和比率的自適應方法后爬取的網頁與主題更加相關,而基于信息量總和比率的自適應方法在總體上則比基于信息增益的自適應方法有更高的穩(wěn)定性。 最后,設計并實現(xiàn)了一個主題爬行的原型系統(tǒng),并利用該原型系統(tǒng)進行了一系列實驗,對本文中提出的方法進行驗證分析。
[Abstract]:With the rapid development of Internet technology, the traditional universal search engine gradually exposed the shortcomings of low coverage and inaccurate results.In order to meet the needs of users for accurate search, vertical search engine emerged as the times require.It makes use of topic crawling technology to collect web pages related to a domain (topic) in Web and provides retrieval services for that domain.Undoubtedly, subject crawling technology is the core part of vertical search engine, which directly affects the performance of vertical search engine.This paper focuses on the key technologies of topic crawling, such as topic description, candidate link priority prediction and adaptive crawling strategy. The main contents are as follows:A method of subject description based on Wikipedia is proposed.A clear and accurate description of the theme is the basis of the theme crawler, and the method of theme description also determines the calculation method of the theme correlation.Most of the existing algorithms describe the topic based on feature set and calculate the relevance of the topic by the mechanical matching of the feature words. However, it not only ignores the semantic relationship among the feature words, but also makes the distribution of the feature words too sparse.Some methods have been introduced to analyze the semantic association between words, but few ontologies are available, and most semantic dictionaries have poor openness and limited vocabulary.The shortcoming of updating is not in time.Aiming at these shortcomings, this paper uses Wikipedia, which is easy to obtain, update and describe objectively, as background knowledge, constructs topic vector space according to classification tree, and maps topic description document to vector to describe topic.At the same time, the disambiguation reference table is used to solve the problem that the mapping is not practical or polysemous in the process of mapping words to concepts.The experimental results show that this method is more effective than the traditional method in the sum of information and precision.A candidate link priority prediction method based on web page partitioning is proposed.The priority prediction of candidate link determines the direction and result of topic crawling. Most of the existing algorithms predict the priority of candidate link according to the page content, anchor text and anchor text context, but the page contains noise data such as advertisement, etc.The context of anchor text is difficult to define and the amount of information contained in anchor text is very limited.Therefore, based on depth-first traversal, this paper divides the web page into blocks, filters out some noise nodes, and then synthetically predicts the priority of candidate links from three aspects: page content text, block text and anchor text.Experimental results show that the performance of topic crawling is improved effectively by introducing web page partitioning.3) two adaptive methods based on information gain and information sum ratio are proposed.Because the initial description of the subject is often not objective and accurate according to the conceptual hierarchy of the classification tree, this paper, after crawling a certain number of web pages,The weight of each concept in the topic vector space is updated automatically by recalculating all pages crawled according to two adaptive methods so as to perfect the topic description.Experiments show that both of them achieve incremental crawling of topics, and that the pages crawled after the adaptive method based on information gain are more relevant to the topic than the pages crawled by the adaptive method based on the sum of information ratio.On the whole, the adaptive method based on the sum ratio of information is more stable than the adaptive method based on information gain.Finally, a subject crawling prototype system is designed and implemented, and a series of experiments are carried out using the prototype system, and the method proposed in this paper is verified and analyzed.
【學位授予單位】:重慶大學
【學位級別】:碩士
【學位授予年份】:2012
【分類號】:TP391.3

【參考文獻】

相關期刊論文 前4條

1 王輝;左萬利;王暉昱;寧愛軍;孫志偉;滿春雷;;基于質心向量的增量式主題爬行[J];計算機研究與發(fā)展;2009年02期

2 歐陽柳波,李學勇,李國徽,王鑫;專業(yè)搜索引擎搜索策略綜述[J];計算機工程;2004年13期

3 趙佳鶴;王秀坤;劉亞欣;;基于語義分析的主題信息采集系統(tǒng)的設計與實現(xiàn)[J];計算機應用;2007年02期

4 蔣宗禮;徐學可;李帥;;一種基于超鏈接引導的主題搜索的主題敏感爬行方法[J];計算機應用;2008年04期

相關博士學位論文 前1條

1 陳竹敏;面向垂直搜索引擎的主題爬行技術研究[D];山東大學;2008年

相關碩士學位論文 前2條

1 王曉偉;垂直搜索引擎若干關鍵技術的研究[D];浙江大學;2007年

2 林碧霞;基于領域本體的主題爬蟲研究及實現(xiàn)[D];西南交通大學;2010年



本文編號:1752062

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1752062.html


Copyright(c)文論論文網All Rights Reserved | 網站地圖 |

版權申明:資料由用戶4cf65***提供,本站僅收錄摘要或目錄,作者需要刪除請E-mail郵箱bigeng88@qq.com
狠狠干狠狠操在线播放| 亚洲熟女一区二区三四区| 懂色一区二区三区四区| 欧美性欧美一区二区三区| 99日韩在线视频精品免费| 91免费精品国自产拍偷拍| 好吊视频一区二区在线| 亚洲欧美日韩国产自拍| 中文日韩精品视频在线| 黄色国产一区二区三区| 免费国产成人性生活生活片| 日韩人妻欧美一区二区久久| 日韩中文字幕欧美亚洲| 日本免费熟女一区二区三区| 永久福利盒子日韩日韩| 欧美性猛交内射老熟妇| 免费在线成人午夜视频| 男人的天堂的视频东京热| 欧美日韩人妻中文一区二区| 九九热在线视频观看最新| 色婷婷人妻av毛片一区二区三区| 日韩欧美综合中文字幕 | 久久国产亚洲精品赲碰热| 午夜福利视频偷拍91| 午夜资源在线观看免费高清| 日本中文在线不卡视频| 国产成人精品资源在线观看| 亚洲一区二区欧美激情| 亚洲天堂精品1024| 小草少妇视频免费看视频| 国产欧美另类激情久久久| 日木乱偷人妻中文字幕在线| 熟女白浆精品一区二区| 高清国产日韩欧美熟女| 黄片免费观看一区二区| 高清不卡视频在线观看| 欧美日韩国产免费看黄片| 亚洲中文字幕视频在线观看| 久久这里只精品免费福利| 午夜福利92在线观看| 日本高清不卡在线一区|