移動互聯(lián)網(wǎng)內(nèi)容相似性研究
發(fā)布時間:2018-07-15 20:05
【摘要】:隨著互聯(lián)網(wǎng)的發(fā)展,網(wǎng)絡(luò)信息呈爆炸式增長。由于眾多鏡像站點、轉(zhuǎn)載網(wǎng)頁、復(fù)制網(wǎng)頁的存在,使網(wǎng)絡(luò)中充斥著大量相似內(nèi)容,這些內(nèi)容降低搜索引擎結(jié)果的質(zhì)量、浪費硬件存儲資源、影響移動用戶的使用體驗。近年來移動互聯(lián)網(wǎng)的發(fā)展,問題越加嚴(yán)重。 針對目前在移動互聯(lián)網(wǎng)相似性方面研究的不足,本課題集中于網(wǎng)頁正文抽取技術(shù)和網(wǎng)頁相似性計算。在網(wǎng)頁正文抽取技術(shù)方面,首先比較了基于統(tǒng)計的網(wǎng)頁正文抽取技術(shù)、基于視覺分塊的網(wǎng)頁正文抽取技術(shù)及其他網(wǎng)頁正文抽取技術(shù),然后本論文提出一種基于主題相似分塊的網(wǎng)頁正文抽取技術(shù)。在網(wǎng)頁相似性計算方面,首先比較了基于向量的相似性技術(shù)、基于特征的相似性技術(shù)、基于網(wǎng)頁文本結(jié)構(gòu)的相似性技術(shù)和基于語義的相似性技術(shù),然后提出一種基于特征詞的網(wǎng)頁相似性算法。 基于主題相似分塊的網(wǎng)頁正文抽取技術(shù)以標(biāo)題標(biāo)簽和分塊內(nèi)容的相似性為基礎(chǔ),通過構(gòu)建網(wǎng)頁樹,抽取網(wǎng)頁的正文內(nèi)容。實驗表明,該算法對復(fù)雜網(wǎng)頁抽取準(zhǔn)確率高。 基于特征詞的網(wǎng)頁相似性算法首先提取網(wǎng)頁特征詞,然后利用局部敏感哈希、分塊查找等技術(shù),計算網(wǎng)頁的相似性。實驗表明,該算法提高了短文本網(wǎng)頁的查全率和查準(zhǔn)率,,降低了復(fù)雜度,適合大規(guī)模數(shù)據(jù)應(yīng)用。
[Abstract]:With the development of the Internet, network information is explosive growth. Because of the existence of many mirror sites, reprinting web pages and duplicating web pages, the network is filled with a lot of similar content, which reduces the quality of search engine results, wastes hardware storage resources, and affects the use experience of mobile users. In recent years, the development of mobile Internet, more and more serious problems. In view of the deficiency of the research on the similarity of mobile Internet, this paper focuses on the text extraction technology and the calculation of the similarity of the web pages. In the aspect of page text extraction, firstly, the paper compares the technology of page text extraction based on statistics, the technology of page text extraction based on visual block and other technology of web page text extraction. Then this paper proposes a text extraction technique based on topic similarity partitioning. In the aspect of web page similarity calculation, we first compare the similarity technology based on vector, feature based similarity, page text structure similarity and semantic similarity. Then a feature-based web page similarity algorithm is proposed. Based on the similarity of title label and block content, the text extraction technique based on topic similarity block is used to extract the text content of a web page by constructing a web page tree. Experiments show that the algorithm has high accuracy for complex web page extraction. The similarity algorithm of web pages based on feature words firstly extracts the feature words, and then calculates the similarity of web pages by using local sensitive hashing and block lookup techniques. Experiments show that the algorithm improves the recall and precision of short text pages, reduces the complexity and is suitable for large-scale data applications.
【學(xué)位授予單位】:華中科技大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2013
【分類號】:TP391.1;TP393.092
本文編號:2125230
[Abstract]:With the development of the Internet, network information is explosive growth. Because of the existence of many mirror sites, reprinting web pages and duplicating web pages, the network is filled with a lot of similar content, which reduces the quality of search engine results, wastes hardware storage resources, and affects the use experience of mobile users. In recent years, the development of mobile Internet, more and more serious problems. In view of the deficiency of the research on the similarity of mobile Internet, this paper focuses on the text extraction technology and the calculation of the similarity of the web pages. In the aspect of page text extraction, firstly, the paper compares the technology of page text extraction based on statistics, the technology of page text extraction based on visual block and other technology of web page text extraction. Then this paper proposes a text extraction technique based on topic similarity partitioning. In the aspect of web page similarity calculation, we first compare the similarity technology based on vector, feature based similarity, page text structure similarity and semantic similarity. Then a feature-based web page similarity algorithm is proposed. Based on the similarity of title label and block content, the text extraction technique based on topic similarity block is used to extract the text content of a web page by constructing a web page tree. Experiments show that the algorithm has high accuracy for complex web page extraction. The similarity algorithm of web pages based on feature words firstly extracts the feature words, and then calculates the similarity of web pages by using local sensitive hashing and block lookup techniques. Experiments show that the algorithm improves the recall and precision of short text pages, reduces the complexity and is suitable for large-scale data applications.
【學(xué)位授予單位】:華中科技大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2013
【分類號】:TP391.1;TP393.092
【參考文獻】
相關(guān)期刊論文 前8條
1 趙文;唐建雄;高慶鋒;;基于統(tǒng)計的中文網(wǎng)頁正文抽取的研究[J];電腦知識與技術(shù);2008年01期
2 王琦,唐世渭,楊冬青,王騰蛟;基于DOM的網(wǎng)頁主題信息自動提取[J];計算機研究與發(fā)展;2004年10期
3 于滿泉,陳鐵睿,許洪波;基于分塊的網(wǎng)頁信息解析器的研究與設(shè)計[J];計算機應(yīng)用;2005年04期
4 魏麗霞;鄭家恒;;基于網(wǎng)頁文本結(jié)構(gòu)的網(wǎng)頁去重[J];計算機應(yīng)用;2007年11期
5 張程;陳自郁;古平;楊瑞龍;;基于DOM樹結(jié)構(gòu)的Blog網(wǎng)頁自動識別[J];計算機應(yīng)用研究;2008年05期
6 孫承杰,關(guān)毅;基于統(tǒng)計的網(wǎng)頁正文信息抽取方法的研究[J];中文信息學(xué)報;2004年05期
7 李綱;戴強斌;;WNBTE網(wǎng)頁正文抽取方法研究[J];情報科學(xué);2008年03期
8 丁振國;吳寶貴;辛友強;;基于Bloom Filter的大規(guī)模網(wǎng)頁去重策略研究[J];現(xiàn)代圖書情報技術(shù);2008年03期
本文編號:2125230
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2125230.html
最近更新
教材專著