中文農(nóng)業(yè)網(wǎng)頁去重及相似度判斷研究
[Abstract]:With the rapid development of network information technology and the construction of agricultural informatization, the service level has been greatly promoted and improved. The massive and repeated agricultural information in the Internet not only brings convenience to friends engaged in the field of agriculture, but also increases the difficulty of obtaining effective information quickly and accurately. How to effectively manage the duplicated and approximately duplicated web pages in agricultural web pages has become one of the most important research topics in the field of agricultural vertical search engines. The main work of this paper includes the following aspects: 1) the key technologies of text removal and similarity judgment, page preprocessing, page text content extraction, Chinese word segmentation, feature weighting algorithm, web page de-duplication method, are studied in depth. Text similarity calculation method and similarity evaluation standard technology, based on agricultural web page corpus, this paper focuses on web page de-duplication technology, feature weighting algorithm and similarity calculation method. 2) the definition standard of Chinese agricultural web pages is studied, and the corpus of Chinese agricultural web pages is constructed. A set of manually identified web pages is established, which consists of 225 sets of web pages. Each set of web pages consists of 2 to 14 approximately repeated pages. A total of 1110 pages are used as web page test sets. 3) preprocessing the web page, using MD5 method to remove the same page in the web page collection, then extracting the text of the other web pages, using the word segmentation method of Pao Ding Jie Niu, after removing the stop word. Three methods, Boolean weight and word frequency inverted document weight, are used to calculate the weight of feature words. Finally, three similarity algorithms (vector space model, semantic similarity based on knowledge net, latent semantic analysis) are used to calculate the similarity of three kinds of feature vector space models with different weights. Finally, 9 groups of Chinese agricultural web page similarity judgment results are obtained. 4) the accuracy, recall rate and F1 measure of 9 groups of experiments were analyzed and compared. The results show that none of the feature weighting algorithms has an absolute advantage in similarity judgment, and each of the three feature weighting algorithms has its own advantages and disadvantages in different similarity judgment. The comparison of different similarity judgment methods shows that the potential semantic analysis has the best similarity judgment result. The MD5 method was used to remove 41 web pages which were completely duplicated with other web pages, and the other 1069 web pages were further studied by using different similarity judgment methods combined with weight calculation to determine the similarity of agricultural web pages. Through the analysis and comparison of the experimental results, the results show that the potential semantic analysis combined with the Boolean weight value has the best result in judging the similarity of agricultural web pages, and the comprehensive evaluation of F1 measure is 90.1. And the accuracy rate reached 93. 7%.
【學(xué)位授予單位】:新疆農(nóng)業(yè)大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2014
【分類號】:TP391.1;TP393.092
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 宋濤;施水才;房祥;呂學(xué)強(qiáng);;基于改進(jìn)的潛在語義分析的文本聚類[J];北京信息科技大學(xué)學(xué)報(自然科學(xué)版);2012年03期
2 李劍;李金厚;;一種基于知網(wǎng)的概念相似度計(jì)算方法[J];工業(yè)控制計(jì)算機(jī);2011年04期
3 李進(jìn);;基于知網(wǎng)的句子相似度計(jì)算的研究[J];電腦知識與技術(shù);2012年29期
4 張煥炯,王國勝,鐘義信;基于漢明距離的文本相似度計(jì)算[J];計(jì)算機(jī)工程與應(yīng)用;2001年19期
5 晉耀紅;基于語義的文本過濾系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)[J];計(jì)算機(jī)工程與應(yīng)用;2003年17期
6 宋楓溪,高林;文本分類器性能評估指標(biāo)[J];計(jì)算機(jī)工程;2004年13期
7 潘謙紅,王炬,史忠植;基于屬性論的文本相似度計(jì)算[J];計(jì)算機(jī)學(xué)報;1999年06期
8 黃承慧;印鑒;侯f ;;一種結(jié)合詞項(xiàng)語義信息和TF-IDF方法的文本相似度量方法[J];計(jì)算機(jī)學(xué)報;2011年05期
9 張玉芳;朱俊;熊忠陽;;改進(jìn)的概率潛在語義分析下的文本聚類算法[J];計(jì)算機(jī)應(yīng)用;2011年03期
10 趙欣欣;索紅光;劉玉樹;;基于標(biāo)記窗的網(wǎng)頁正文信息提取方法[J];計(jì)算機(jī)應(yīng)用研究;2007年03期
相關(guān)博士學(xué)位論文 前2條
1 宋玲;語義相似度計(jì)算及其應(yīng)用研究[D];山東大學(xué);2009年
2 劉宏哲;文本語義相似度計(jì)算方法研究[D];北京交通大學(xué);2012年
本文編號:2328640
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2328640.html