當(dāng)前位置：主頁 > 管理論文 > 移動(dòng)網(wǎng)絡(luò)論文 >

近似鏡像網(wǎng)頁去重方法研究

發(fā)布時(shí)間：2018-05-30 21:27

本文選題：近似鏡像網(wǎng)頁 + Simhash　；參考：《東華大學(xué)》2017年碩士論文

【摘要】：隨著信息技術(shù)的飛速發(fā)展,互聯(lián)網(wǎng)上的網(wǎng)頁數(shù)據(jù)呈現(xiàn)出爆炸式的增長態(tài)勢,大量近似鏡像網(wǎng)頁的存在已經(jīng)成為人們快速獲取有效訊息的最大阻礙。為了解決搜索中網(wǎng)絡(luò)上存在大量重復(fù)網(wǎng)頁的問題,研究人員提出了多種近似鏡像網(wǎng)頁去重算法,在普通的信息檢索過程中取得了較好的去重效果,但是在網(wǎng)頁噪聲抵抗方面的表現(xiàn)并不令人滿意。對(duì)于一些實(shí)時(shí)性高的新聞?lì)惥W(wǎng)頁,這些算法常出現(xiàn)誤判,算法的穩(wěn)定性不高。針對(duì)上述問題,嘗試了兩種基于Simhash的網(wǎng)頁去重算法改善網(wǎng)頁搜索去重問題。算法一是基于Simhash的長句提取近似鏡像網(wǎng)頁去重算法,解決算法的噪聲敏感問題。目前常用的網(wǎng)頁去重算法均包含特征提取環(huán)節(jié),存在噪聲詞匯,影響了網(wǎng)頁去重算法的準(zhǔn)確率與召回率。對(duì)網(wǎng)頁噪聲分析后發(fā)現(xiàn),噪聲文本長度一般都較短,通過把提取的網(wǎng)頁文本長句作為特征詞的分割范圍能夠有效規(guī)避網(wǎng)頁中存在的噪聲信息,減弱噪聲對(duì)于算法的不利影響。算法二是基于Simhash的特殊權(quán)重比近似鏡像網(wǎng)頁去重算法,解決網(wǎng)頁去重算法對(duì)實(shí)時(shí)性高的新聞?lì)惥W(wǎng)頁進(jìn)行去重時(shí)常出現(xiàn)誤判的問題。由于Simhash算法給予特征詞的權(quán)重是依據(jù)簡單的詞頻統(tǒng)計(jì)來操作的,對(duì)于同一類別的新聞網(wǎng)頁,網(wǎng)頁文本常常相似,只在時(shí)間與地點(diǎn)上有所不同,這導(dǎo)致Simhash算法提取的特征詞與其對(duì)應(yīng)的權(quán)重都是相似的,最終造成了結(jié)果的誤判�；赟imhash的特殊權(quán)重考慮了核心詞匯因素,對(duì)于新聞中的核心詞匯賦予其額外的權(quán)重比,增強(qiáng)其對(duì)于文本指紋值的影響力,使得兩個(gè)核心詞匯相差較大的網(wǎng)頁能夠被區(qū)分出來。最后,結(jié)合實(shí)際需求,將本文提出的兩種算法運(yùn)用到了自貿(mào)區(qū)企業(yè)動(dòng)態(tài)信息系統(tǒng)中的網(wǎng)頁去重模塊中,通過實(shí)踐證明了算法的科學(xué)性與有效性。
[Abstract]:With the rapid development of information technology, the data of web pages on the Internet show an explosive growth trend. The existence of a large number of approximate mirror pages has become the biggest obstacle for people to obtain effective information quickly. In order to solve the problem that there are a large number of duplicate web pages in the search network, researchers have proposed a variety of approximate image page de-duplication algorithms, which have achieved better results in the common information retrieval process. However, the performance of the web noise resistance is not satisfactory. For some real-time news pages, these algorithms often appear misjudgment, and the stability of these algorithms is not high. In order to solve the above problems, two kinds of Simhash based web page de-reduplication algorithms are tried to improve the web search de-reduplication problem. The first algorithm is an approximate mirror page de-duplication algorithm based on Simhash to solve the noise-sensitive problem of the algorithm. At present, the commonly used algorithms include feature extraction and noise vocabulary, which affect the accuracy and recall rate of the algorithm. It is found that the length of the noisy text is generally short. By using the extracted long sentence of the web page as the segmentation range of the feature words, the noise information in the web page can be effectively avoided and the adverse effect of the noise on the algorithm can be reduced. The second algorithm is based on the special weight ratio of Simhash, which solves the problem that the reversion of real-time news pages is often caused by misjudgment. Because the weight given by Simhash algorithm to feature words is based on simple word frequency statistics, for the same type of news pages, the text of web pages is often similar, only in time and place. As a result, the feature words extracted by Simhash algorithm are similar to their corresponding weights, and the result is misjudged. The special weight based on Simhash takes into account the factors of core vocabulary, gives it extra weight ratio to the core words in news, enhances its influence on the fingerprint value of text, and makes the web pages with big differences between the two core words can be distinguished. Finally, combined with the actual demand, the two algorithms proposed in this paper are applied to the web page de-reduplication module in the enterprise dynamic information system of the free trade area, and the scientific and effective algorithm is proved by practice.
【學(xué)位授予單位】：東華大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2017
【分類號(hào)】：TP393.092

【相似文獻(xiàn)】

相關(guān)期刊論文前10條

1 朱松巖;;網(wǎng)頁設(shè)計(jì)之特性分析[J];山東省農(nóng)業(yè)管理干部學(xué)院學(xué)報(bào);2009年03期

2 安琳;;國外網(wǎng)頁信息存檔項(xiàng)目及相關(guān)問題研究[J];圖書館建設(shè);2009年12期

3 蔣桂梅;;網(wǎng)頁設(shè)計(jì)的藝術(shù)性[J];電腦知識(shí)與技術(shù);2010年05期

4 龍正義;;網(wǎng)頁長期保存的策略與方法研究[J];檔案管理;2010年03期

5 李志義;梁士金;;國內(nèi)網(wǎng)頁去重技術(shù)研究:現(xiàn)狀與總結(jié)[J];圖書情報(bào)工作;2011年07期

6 王爍;;美國網(wǎng)頁歸檔項(xiàng)目——Internet Archive發(fā)展研究[J];蘭臺(tái)世界;2012年17期

7 栗勇兵;韓平;董啟雄;;網(wǎng)頁信息自動(dòng)提取的設(shè)計(jì)與實(shí)現(xiàn)[J];計(jì)算機(jī)光盤軟件與應(yīng)用;2012年18期

8 何立波;周世波;;網(wǎng)頁設(shè)計(jì)中的藝術(shù)研究[J];考試周刊;2011年25期

9 秦永平;網(wǎng)頁信息共享技術(shù)[J];計(jì)算機(jī)應(yīng)用;2000年02期

10 項(xiàng)鎮(zhèn);網(wǎng)頁設(shè)計(jì)新概念[J];江西教育學(xué)院學(xué)報(bào)(自然科學(xué));2001年06期

相關(guān)會(huì)議論文前10條

1 吳建軍;;談網(wǎng)頁設(shè)計(jì)的藝術(shù)性表現(xiàn)[A];經(jīng)天緯地——全國測繪科技信息網(wǎng)中南分網(wǎng)第十九次學(xué)術(shù)交流會(huì)優(yōu)秀論文選編[C];2005年

2 韓近強(qiáng);趙靜;楊冬青;唐世渭;姚小波;;基于領(lǐng)域知識(shí)的網(wǎng)頁篩選系統(tǒng)[A];第十九屆全國數(shù)據(jù)庫學(xué)術(shù)會(huì)議論文集（技術(shù)報(bào)告篇）[C];2002年

3 昝紅英;蘇玉梅;孫斌;俞士汶;;基于淺層分析的網(wǎng)頁相關(guān)度研究[A];語言計(jì)算與基于內(nèi)容的文本處理——全國第七屆計(jì)算語言學(xué)聯(lián)合學(xué)術(shù)會(huì)議論文集[C];2003年

4 孫靜;劉正捷;奚小玲;王慧;;幫助盲人理解網(wǎng)頁信息的一種網(wǎng)頁結(jié)構(gòu)劃分方法[A];第一屆建立和諧人機(jī)環(huán)境聯(lián)合學(xué)術(shù)會(huì)議（HHME2005）論文集[C];2005年

5 曹淮;晁丁丁;;3D元素在網(wǎng)頁信息傳達(dá)中的應(yīng)用研究[A];2006年中國機(jī)械工程學(xué)會(huì)年會(huì)暨中國工程院機(jī)械與運(yùn)載工程學(xué)部首屆年會(huì)論文集[C];2006年

6 唐超;劉辰;楊正球;;使用多層迭代分析和分類網(wǎng)頁文檔的方法[A];2007北京地區(qū)高校研究生學(xué)術(shù)交流會(huì)通信與信息技術(shù)會(huì)議論文集（上冊(cè)）[C];2008年

7 馬驍;王曉龍;王軒;卜永忠;;基于網(wǎng)頁信息結(jié)構(gòu)的網(wǎng)頁體裁聚類分析[A];第四屆全國信息檢索與內(nèi)容安全學(xué)術(shù)會(huì)議論文集（上）[C];2008年

8 羅陽;季鐸;張桂平;王瑩瑩;;面向單一網(wǎng)頁的雙語資源挖掘方法[A];第六屆全國信息檢索學(xué)術(shù)會(huì)議論文集[C];2010年

9 于滿泉;譚松波;許洪波;;網(wǎng)頁內(nèi)部結(jié)構(gòu)挖掘技術(shù)研究[A];NCIRCS2004第一屆全國信息檢索與內(nèi)容安全學(xué)術(shù)會(huì)議論文集[C];2004年

10 王宇;黃煒;肖艷芹;任建立;李天柱;;ORBASE用于基于內(nèi)容的Web查詢[A];第十七屆全國數(shù)據(jù)庫學(xué)術(shù)會(huì)議論文集（技術(shù)報(bào)告篇）[C];2000年

相關(guān)重要報(bào)紙文章前10條

1 本報(bào)記者曾居仁　通訊員郝金榮;貴州“萬村千鄉(xiāng)”網(wǎng)頁工程開辟為農(nóng)服務(wù)新渠道[N];中國氣象報(bào);2012年

2 壯壯;批量保存網(wǎng)頁信息[N];電腦報(bào);2004年

3 羅震宇　嚴(yán)小斌;一種新型WEB開發(fā)技術(shù)的探討[N];中國冶金報(bào);2011年

4 錢鵬;網(wǎng)盡Web頁中的好東東[N];電腦報(bào);2004年

5 星之海洋;邁出網(wǎng)頁制作的第一步[N];電腦報(bào);2004年

6 河南張金貴;FrontPage2000組件詳解（四）[N];電腦報(bào);2001年

7 楓爾;網(wǎng)站瀏覽提速的五大秘方[N];中國證券報(bào);2004年

8 飄零劍客;網(wǎng)絡(luò)監(jiān)控利器——AnyView[N];中國電腦教育報(bào);2004年

9 八戒;眨眼之間答案立現(xiàn)[N];電腦報(bào);2013年

10 ;網(wǎng)絡(luò)應(yīng)用天龍八“步” 申請(qǐng)上網(wǎng)賬號(hào)[N];電腦報(bào);2002年

相關(guān)博士學(xué)位論文前10條

1 陳潔;基于概念融合的網(wǎng)頁篩選技術(shù)研究[D];北京郵電大學(xué);2013年

2 龔昌盛;基于語義標(biāo)注的網(wǎng)頁廣告加載模型研究[D];武漢大學(xué);2010年

3 孫建濤;Web挖掘中的降維和分類方法研究[D];清華大學(xué);2005年

4 黃華軍;網(wǎng)頁信息隱藏與隱秘信息檢測研究[D];湖南大學(xué);2007年

5 徐晴陽;基于關(guān)系子群發(fā)現(xiàn)算法的聚焦爬行技術(shù)[D];吉林大學(xué);2008年

6 曹魯慧;Web個(gè)人信息集成問題研究[D];山東大學(xué);2012年

7 劉馨月;Web挖掘中的鏈接分析與話題檢測研究[D];大連理工大學(xué);2012年

8 羅娜;基于本體的主題爬行技術(shù)研究[D];吉林大學(xué);2009年

9 張勇實(shí);基于鏈接相似性分析的WEB結(jié)構(gòu)挖掘方法研究[D];哈爾濱工程大學(xué);2012年

10 宗校軍;中文網(wǎng)頁定題采集及分類研究[D];華中科技大學(xué);2006年

相關(guān)碩士學(xué)位論文前10條

1 楊尋;地域文化的視覺元素在旅游網(wǎng)頁設(shè)計(jì)中的應(yīng)用研究[D];西南交通大學(xué);2015年

2 毛凱;基于Jsoup的通用網(wǎng)頁采集系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)[D];電子科技大學(xué);2015年

3 王延江;企業(yè)搜索引擎排序技術(shù)的研究[D];大連海事大學(xué);2016年

4 石雁;基于查詢偏好的個(gè)性化搜索引擎的研究與實(shí)現(xiàn)[D];江南大學(xué);2016年

5 王一兵;病友系統(tǒng)關(guān)鍵技術(shù)應(yīng)用研究與實(shí)現(xiàn)[D];浙江大學(xué);2016年

6 肖悅;基于文本密度和頁面結(jié)構(gòu)的網(wǎng)頁信息抽取技術(shù)研究與實(shí)現(xiàn)[D];中國海洋大學(xué);2015年

7 聶英;網(wǎng)頁設(shè)計(jì)中信息傳達(dá)的人性化探究[D];西北師范大學(xué);2015年

8 陳屹;基于多特征的網(wǎng)頁信息抽取技術(shù)的研究與應(yīng)用[D];中國海洋大學(xué);2015年

9 韋永壯;中文新聞重復(fù)網(wǎng)頁檢測研究[D];南京大學(xué);2014年

10 李明冬;基于內(nèi)存計(jì)算的文本聚類算法的研究與實(shí)現(xiàn)[D];東南大學(xué);2015年

，

本文編號(hào)：1956877

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/guanlilunwen/ydhl/1956877.html

上一篇：面向新聞網(wǎng)的智能抓取技術(shù)
下一篇：武漢郵科院主導(dǎo)制定的Y.2770國際標(biāo)準(zhǔn)誕生——成為全球首個(gè)互聯(lián)網(wǎng)業(yè)務(wù)感知和內(nèi)容識(shí)別的國際標(biāo)準(zhǔn)

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級(jí)|國家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

近似鏡像網(wǎng)頁去重方法研究