中文農(nóng)業(yè)網(wǎng)頁去重及相似度判斷研究

發(fā)布時間：2018-11-13 09:04

【摘要】：隨著網(wǎng)絡(luò)信息技術(shù)的飛速發(fā)展，農(nóng)業(yè)信息化的建設(shè)、服務(wù)水平得到了極大的促進(jìn)與提高�；ヂ�(lián)網(wǎng)中海量、重復(fù)的農(nóng)業(yè)信息為從事農(nóng)業(yè)領(lǐng)域的朋友們帶來方便的同時，也增加了快速、準(zhǔn)確獲取有效信息的難度。如何對農(nóng)業(yè)網(wǎng)頁中重復(fù)以及近似重復(fù)的網(wǎng)頁進(jìn)行有效的管理，成為農(nóng)業(yè)垂直搜索引擎領(lǐng)域研究的重要課題之一。本文的工作主要包括以下幾個方面： 1）深入研究了文本去重及相似度判斷的關(guān)鍵技術(shù)，，網(wǎng)頁預(yù)處理、網(wǎng)頁正文內(nèi)容提取、中文分詞、特征加權(quán)算法、網(wǎng)頁去重方法、文本相似度計(jì)算方法以及相似度評價標(biāo)準(zhǔn)技術(shù)，以農(nóng)業(yè)網(wǎng)頁語料庫為基礎(chǔ)，重點(diǎn)研究了網(wǎng)頁去重技術(shù)、特征加權(quán)算法以及相似度計(jì)算的方法。 2）對中文農(nóng)業(yè)網(wǎng)頁中重復(fù)及近似重復(fù)的網(wǎng)頁的定義標(biāo)準(zhǔn)進(jìn)行研究，構(gòu)建出中文農(nóng)業(yè)網(wǎng)頁語料庫。建立一個由人工鑒別出的網(wǎng)頁集合，包含225組網(wǎng)頁集，每組網(wǎng)頁集中有2至14張近似重復(fù)網(wǎng)頁，共1110篇網(wǎng)頁作為網(wǎng)頁測試集。 3）首先對網(wǎng)頁進(jìn)行預(yù)處理，使用MD5方法去除網(wǎng)頁集合中完全相同的網(wǎng)頁，再對其余網(wǎng)頁提取出正文內(nèi)容，利用庖丁解牛分詞方法進(jìn)行分詞、去除停用詞后，分別使用布爾權(quán)重、詞頻權(quán)重、詞頻倒文檔權(quán)重三種方法對特征詞進(jìn)行加權(quán)計(jì)算；最后分別使用三種相似度算法（向量空間模型、基于《知網(wǎng)》的語義相似度、潛在語義分析）對三種不同權(quán)重的特征向量空間模型進(jìn)行了相似度計(jì)算，最終得到9組中文農(nóng)業(yè)網(wǎng)頁相似度判斷結(jié)果。 4）分析比較了9組實(shí)驗(yàn)的準(zhǔn)確率、召回率、F1測度。結(jié)果表明，沒有哪種特征加權(quán)算法對相似度判斷有絕對的優(yōu)勢，三種特征加權(quán)算法在不同的相似度判斷中各有優(yōu)劣。不同相似度判斷方法分析對比表明潛在語義分析相似度判斷結(jié)果最好。通過MD5方法去除了41篇與其它網(wǎng)頁完全重復(fù)的網(wǎng)頁，對剩余1069篇網(wǎng)頁使用不同的相似度判斷方法結(jié)合權(quán)重計(jì)算對農(nóng)業(yè)網(wǎng)頁去重及相似度判斷進(jìn)行了深入研究。通過實(shí)驗(yàn)結(jié)果的分析與對比，結(jié)果表明潛在語義分析結(jié)合布爾權(quán)重值獲得的結(jié)果，對農(nóng)業(yè)網(wǎng)頁相似度判斷有最好的結(jié)果，綜合評價F1測度為90.1%，且準(zhǔn)確率達(dá)到了93.7%。
[Abstract]:With the rapid development of network information technology and the construction of agricultural informatization, the service level has been greatly promoted and improved. The massive and repeated agricultural information in the Internet not only brings convenience to friends engaged in the field of agriculture, but also increases the difficulty of obtaining effective information quickly and accurately. How to effectively manage the duplicated and approximately duplicated web pages in agricultural web pages has become one of the most important research topics in the field of agricultural vertical search engines. The main work of this paper includes the following aspects: 1) the key technologies of text removal and similarity judgment, page preprocessing, page text content extraction, Chinese word segmentation, feature weighting algorithm, web page de-duplication method, are studied in depth. Text similarity calculation method and similarity evaluation standard technology, based on agricultural web page corpus, this paper focuses on web page de-duplication technology, feature weighting algorithm and similarity calculation method. 2) the definition standard of Chinese agricultural web pages is studied, and the corpus of Chinese agricultural web pages is constructed. A set of manually identified web pages is established, which consists of 225 sets of web pages. Each set of web pages consists of 2 to 14 approximately repeated pages. A total of 1110 pages are used as web page test sets. 3) preprocessing the web page, using MD5 method to remove the same page in the web page collection, then extracting the text of the other web pages, using the word segmentation method of Pao Ding Jie Niu, after removing the stop word. Three methods, Boolean weight and word frequency inverted document weight, are used to calculate the weight of feature words. Finally, three similarity algorithms (vector space model, semantic similarity based on knowledge net, latent semantic analysis) are used to calculate the similarity of three kinds of feature vector space models with different weights. Finally, 9 groups of Chinese agricultural web page similarity judgment results are obtained. 4) the accuracy, recall rate and F1 measure of 9 groups of experiments were analyzed and compared. The results show that none of the feature weighting algorithms has an absolute advantage in similarity judgment, and each of the three feature weighting algorithms has its own advantages and disadvantages in different similarity judgment. The comparison of different similarity judgment methods shows that the potential semantic analysis has the best similarity judgment result. The MD5 method was used to remove 41 web pages which were completely duplicated with other web pages, and the other 1069 web pages were further studied by using different similarity judgment methods combined with weight calculation to determine the similarity of agricultural web pages. Through the analysis and comparison of the experimental results, the results show that the potential semantic analysis combined with the Boolean weight value has the best result in judging the similarity of agricultural web pages, and the comprehensive evaluation of F1 measure is 90.1. And the accuracy rate reached 93. 7%.
【學(xué)位授予單位】：新疆農(nóng)業(yè)大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2014
【分類號】：TP391.1;TP393.092

【參考文獻(xiàn)】

相關(guān)期刊論文前10條

1 宋濤;施水才;房祥;呂學(xué)強(qiáng);;基于改進(jìn)的潛在語義分析的文本聚類[J];北京信息科技大學(xué)學(xué)報(自然科學(xué)版);2012年03期

2 李劍;李金厚;;一種基于知網(wǎng)的概念相似度計(jì)算方法[J];工業(yè)控制計(jì)算機(jī);2011年04期

3 李進(jìn);;基于知網(wǎng)的句子相似度計(jì)算的研究[J];電腦知識與技術(shù);2012年29期

4 張煥炯,王國勝,鐘義信;基于漢明距離的文本相似度計(jì)算[J];計(jì)算機(jī)工程與應(yīng)用;2001年19期

5 晉耀紅;基于語義的文本過濾系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)[J];計(jì)算機(jī)工程與應(yīng)用;2003年17期

6 宋楓溪,高林;文本分類器性能評估指標(biāo)[J];計(jì)算機(jī)工程;2004年13期

7 潘謙紅,王炬,史忠植;基于屬性論的文本相似度計(jì)算[J];計(jì)算機(jī)學(xué)報;1999年06期

8 黃承慧;印鑒;侯f ;;一種結(jié)合詞項(xiàng)語義信息和TF-IDF方法的文本相似度量方法[J];計(jì)算機(jī)學(xué)報;2011年05期

9 張玉芳;朱俊;熊忠陽;;改進(jìn)的概率潛在語義分析下的文本聚類算法[J];計(jì)算機(jī)應(yīng)用;2011年03期

10 趙欣欣;索紅光;劉玉樹;;基于標(biāo)記窗的網(wǎng)頁正文信息提取方法[J];計(jì)算機(jī)應(yīng)用研究;2007年03期

相關(guān)博士學(xué)位論文前2條

1 宋玲;語義相似度計(jì)算及其應(yīng)用研究[D];山東大學(xué);2009年

2 劉宏哲;文本語義相似度計(jì)算方法研究[D];北京交通大學(xué);2012年

本文編號：2328640

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2328640.html

上一篇：論搜索引擎網(wǎng)絡(luò)服務(wù)提供商侵權(quán)責(zé)任的承擔(dān)——對現(xiàn)行主流觀點(diǎn)的質(zhì)疑
下一篇：搜索引擎營銷賬戶策略研究

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

中文農(nóng)業(yè)網(wǎng)頁去重及相似度判斷研究