Web數(shù)據(jù)融合中網(wǎng)頁清洗相關(guān)技術(shù)研究
本文關(guān)鍵詞: 網(wǎng)頁清洗 詞葉率 重復(fù)網(wǎng)頁 主題分割 分級(jí)檢索 出處:《中南大學(xué)》2014年碩士論文 論文類型:學(xué)位論文
【摘要】:互聯(lián)網(wǎng)中存在大量的重復(fù)網(wǎng)頁和網(wǎng)頁噪聲,用戶可能需要花費(fèi)比預(yù)期更長(zhǎng)時(shí)間以獲取所需信息。利用Web數(shù)據(jù)融合給用戶呈現(xiàn)所需信息之前,需要對(duì)這些內(nèi)容進(jìn)行清洗。 利用網(wǎng)頁代碼的層次結(jié)構(gòu)以及網(wǎng)頁正文內(nèi)容的特征信息,本文采用基于DOM結(jié)構(gòu)樹和詞葉率(WLR值)的方法對(duì)網(wǎng)頁噪聲進(jìn)行清洗,所有的操作都在DOM樹上完成,保留Web正文完整的結(jié)構(gòu)信息。在節(jié)點(diǎn)的統(tǒng)計(jì)信息中只計(jì)算所包含的葉子節(jié)點(diǎn)數(shù)(所有的文本內(nèi)容都是包含在葉子節(jié)點(diǎn)中),統(tǒng)計(jì)信息更精確。 在重復(fù)網(wǎng)頁的識(shí)別過程中,為提高特征項(xiàng)對(duì)全文的表征性,采用“先分割,再提取”的特征提取方法,在原有的經(jīng)典分割方法—TSF的基礎(chǔ)上加以改進(jìn),根據(jù)句子相似性矩陣,動(dòng)態(tài)指定塊大小,自動(dòng)識(shí)別主題邊界,不依賴用戶的參與,將網(wǎng)頁文本分割成局部連貫的子主題片段。從每個(gè)主題片段提取關(guān)鍵句作為片段的特征項(xiàng),特征項(xiàng)在一定程度上遵循子主題的變化,能更完整表示一個(gè)網(wǎng)頁的內(nèi)容。 本文中借鑒simHash指紋的生成思路為每個(gè)主題片段生成一個(gè)特征指紋,根據(jù)指紋之間的漢明距離判斷片段之間的相似性,進(jìn)行檢測(cè)之前利用主題片段數(shù)和文本長(zhǎng)度對(duì)網(wǎng)頁庫進(jìn)行過濾,減少需要進(jìn)行檢索的網(wǎng)頁數(shù),借鑒原有的分組檢索方法,對(duì)片段指紋進(jìn)行分級(jí)檢索,提高檢索的效率。 使用本文方法對(duì)網(wǎng)頁進(jìn)行處理,可以提高網(wǎng)頁噪聲和重復(fù)網(wǎng)頁清洗的準(zhǔn)確率和召回率,以避免對(duì)無關(guān)內(nèi)容的操作和網(wǎng)頁的重復(fù)處理,可以節(jié)約存儲(chǔ)空間,提高檢索性能,減少后續(xù)處理過程中的時(shí)間和空間開銷,提高整個(gè)Web融合系統(tǒng)的效率和準(zhǔn)確率。
[Abstract]:There are a lot of duplicate web pages and web page noise in the Internet, and users may need to spend more time than expected to obtain the required information, which needs to be cleaned before using Web data fusion to present the required information to users. Based on the hierarchical structure of the page code and the feature information of the text of the page, this paper uses the method based on the DOM structure tree and the word leaf rate to clean the noise of the web page. All the operations are done in the DOM tree. Only the number of leaf nodes included is calculated in the node statistics (all text content is contained in the leaf node, so the statistics are more accurate. In the process of duplicate web page recognition, in order to improve the representativeness of feature items to the full text, the feature extraction method of "first segmentation, then extraction" is adopted, which is improved on the basis of the original classical segmentation method (TSF), according to the sentence similarity matrix. Dynamically specifying the block size, automatically recognizing the subject boundary, dividing the web page text into locally coherent sub-topic fragments without relying on the participation of the user, and extracting key sentences from each topic fragment as the feature items of the segment. To a certain extent, feature items follow the change of subthemes and can represent the content of a web page more completely. In this paper, we use the idea of simHash fingerprint generation to generate a feature fingerprint for each subject segment, and judge the similarity between fragments according to the hamming distance between fingerprints. In order to reduce the number of web pages that need to be retrieved, we can use the number of topic fragments and the length of text to filter the web page library before the detection, and use the original grouping retrieval method for reference to search the segment fingerprint in order to improve the efficiency of retrieval. Using this method to deal with web pages can improve the accuracy and recall rate of page noise and repeated page cleaning, avoid the operation of irrelevant content and the repeated processing of web pages, save the storage space and improve the retrieval performance. It reduces the time and space cost in the process of subsequent processing and improves the efficiency and accuracy of the whole Web fusion system.
【學(xué)位授予單位】:中南大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2014
【分類號(hào)】:TP393.092
【參考文獻(xiàn)】
相關(guān)期刊論文 前8條
1 丁春;關(guān)鍵詞標(biāo)引的若干問題探討[J];編輯學(xué)報(bào);2004年02期
2 毛先領(lǐng);何靖;閆宏飛;;網(wǎng)頁去噪:研究綜述[J];計(jì)算機(jī)研究與發(fā)展;2010年12期
3 李素建,王厚峰,俞士汶,辛乘勝;關(guān)鍵詞自動(dòng)標(biāo)引的最大熵模型應(yīng)用研究[J];計(jì)算機(jī)學(xué)報(bào);2004年09期
4 王軍;詞表的自動(dòng)豐富——從元數(shù)據(jù)中提取關(guān)鍵詞及其定位[J];中文信息學(xué)報(bào);2005年06期
5 黃昌寧;趙海;;中文分詞十年回顧[J];中文信息學(xué)報(bào);2007年03期
6 錢愛兵;江嵐;;基于改進(jìn)TF-IDF的中文網(wǎng)頁關(guān)鍵詞抽取——以新聞網(wǎng)頁為例[J];情報(bào)理論與實(shí)踐;2008年06期
7 章成志;;自動(dòng)標(biāo)引研究的回顧與展望[J];現(xiàn)代圖書情報(bào)技術(shù);2007年11期
8 袁鑫攀;龍軍;張祖平;桂衛(wèi)華;;Near-duplicate document detection with improved similarity measurement[J];Journal of Central South University;2012年08期
,本文編號(hào):1526617
本文鏈接:http://sikaile.net/guanlilunwen/ydhl/1526617.html