基于改進DOM樹的主題型網(wǎng)頁去噪聲研究
本文選題:主題型網(wǎng)頁 + DOM樹。 參考:《西南大學》2017年碩士論文
【摘要】:隨著Internet的高速發(fā)展,Web上承載的網(wǎng)頁數(shù)據(jù)也與日俱增。一個普通網(wǎng)頁上包含的數(shù)據(jù)一般可以分成兩部分:內(nèi)容塊和噪聲塊,其中噪聲塊主要包括網(wǎng)頁頂部或側(cè)邊的導航欄、四周的廣告條和底部的版權(quán)信息等。噪音數(shù)據(jù)幾乎占據(jù)網(wǎng)頁的一半比例,并且這個比例還在持續(xù)增長。網(wǎng)頁噪音數(shù)據(jù)的持續(xù)增長不僅使用戶更難獲取與主題相關(guān)的信息,而且加大用戶搜索有用信息的效率,因此如何快速去除網(wǎng)頁上與主題信息無關(guān)的噪音信息顯得尤為重要。網(wǎng)頁去噪的方法一般分為基于網(wǎng)頁模板的去噪方法、基于網(wǎng)頁視覺信息的去噪方法和基于DOM樹的去噪方法。本文主要基于DOM樹結(jié)構(gòu)對主題型網(wǎng)頁進行去噪處理。在以往的基于DOM樹的網(wǎng)頁去噪研究中,研究者大多根據(jù)設(shè)定的規(guī)則首先將DOM樹節(jié)點劃分不同類型,然后根據(jù)節(jié)點類型判斷哪些是噪音節(jié)點。但根據(jù)某單一因素便過早將節(jié)點劃分不同類型,可能會造成節(jié)點類型誤判,從而影響后續(xù)的去噪效果。另外本文通過分析國內(nèi)幾大門戶網(wǎng)站的二級詳情頁,發(fā)現(xiàn)主題型的網(wǎng)頁具有主題突出、文字內(nèi)容較多、圖片和鏈接較少等特征。針對以往基于DOM樹研究的不足和主題型網(wǎng)頁的結(jié)構(gòu)特點、文本特點、標簽語義特點等,本文在傳統(tǒng)DOM樹基礎(chǔ)上構(gòu)建一種改進的DOM樹模型,并基于此改進的DOM樹模型給出了主題型網(wǎng)頁的去噪方法,研究的主要內(nèi)容如下:(1)將HTML標簽依據(jù)與主題相關(guān)性和節(jié)點劃分粒度分為主題塊標簽和非主題塊標簽。綜合考慮主題型網(wǎng)頁中標簽與主題語義關(guān)聯(lián)度、節(jié)點內(nèi)鏈接特征值、節(jié)點內(nèi)文本長度、節(jié)點內(nèi)子節(jié)點純文本節(jié)點數(shù)、節(jié)點內(nèi)圖片個數(shù),在構(gòu)建DOM樹時依次給Node節(jié)點添加自定義屬性tagDeg、linkVal、text Len、textNum、picNum。(2)提出了改進DOM樹模型。首先把HTML文檔解析成DOM樹結(jié)構(gòu),然后遍歷DOM樹依次給DOM樹中節(jié)點添加自定義屬性,在對DOM內(nèi)非主題塊節(jié)點進行合并時,同時也對節(jié)點內(nèi)新添加屬性tagDeg和link Val的值進行累加計算,最后構(gòu)建只包含主題塊節(jié)點的改進的DOM樹模型。(3)給出了基于改進DOM樹模型的網(wǎng)頁去噪方法。該方法主要包括網(wǎng)頁預處理、構(gòu)建改進DOM樹模型和改進DOM樹網(wǎng)頁去噪。其中,改進DOM樹網(wǎng)頁去噪中通過分析對比節(jié)點內(nèi)自定義屬性值與設(shè)定的閾值,從而確定并刪除噪音節(jié)點,達到網(wǎng)頁去噪的目的。最后通過實驗分析,表明該方法對主題型網(wǎng)頁具有較好的去噪效果。
[Abstract]:With the rapid development of the Internet, the web data on the Web is also increasing. The data contained on an ordinary web page can be divided into two parts: the content block and the noise block, where the noise block mainly includes the navigation bar at the top or side of the page, the advertising bar around the page and the copyright information at the bottom. Noise data account for almost half of all web pages, and that proportion continues to grow. The continuous growth of noise data not only makes it more difficult for users to obtain theme-related information, but also increases the efficiency of searching useful information. Therefore, how to quickly remove the noise information that is not related to topic information is particularly important. The methods of web page denoising are generally divided into three kinds: one is based on page template, the other is based on visual information and Dom tree. This paper mainly based on Dom tree structure to the theme web page denoising processing. In the previous researches of Web page denoising based on Dom tree, most researchers divide Dom tree nodes into different types according to the set rules, and then judge which noise nodes are noise nodes according to the node types. However, according to a single factor, nodes can be divided into different types prematurely, which may result in node type misjudgment, which will affect the effect of subsequent de-noising. In addition, by analyzing the secondary detail pages of several domestic portals, it is found that the topic-oriented web pages have the characteristics of prominent themes, more text content, less pictures and links, and so on. In view of the shortcomings of previous researches based on Dom tree and the structural characteristics, text features and label semantics of themed web pages, this paper constructs an improved Dom tree model based on the traditional Dom tree. Based on the improved Dom tree model, the denoising method of topic web pages is presented. The main contents of the research are as follows: (1) the HTML tags are divided into topic block tags and non-topic block tags according to their relevance and node granularity. Considering the semantic correlation degree between label and topic, the link eigenvalue, the length of text, the number of pure text nodes, the number of images in nodes. In the process of building Dom tree, we add the custom attribute tagDeglinkValo text LentextNump Numu to Node node in turn. (2) an improved Dom tree model is proposed. Firstly, the HTML document is parsed into a Dom tree structure, then traversing the Dom tree to add custom attributes to the nodes in the Dom tree in turn. When merging the non-topic block nodes in the Dom, it also accumulates the values of the newly added attributes tagDeg and link Val in the nodes. Finally, an improved Dom tree model containing only topic block nodes is constructed. (3) the method of web page denoising based on the improved Dom tree model is presented. This method mainly includes page preprocessing, building improved Dom tree model and improving Dom tree web page denoising. In the improved Dom tree web page denoising, by analyzing and comparing the custom attribute value and the set threshold in the node, the noise node can be determined and deleted to achieve the purpose of web page denoising. Finally, the experimental results show that the method has a better denoising effect on the theme web pages.
【學位授予單位】:西南大學
【學位級別】:碩士
【學位授予年份】:2017
【分類號】:TP393.092
【參考文獻】
相關(guān)期刊論文 前7條
1 謝方立;周國民;王健;;基于節(jié)點類型標注的網(wǎng)頁主題信息抽取方法[J];計算機科學;2016年S2期
2 彭紅超;童名文;鄒軍華;郝秋紅;;基于規(guī)則的網(wǎng)頁分割預處理算法研究[J];計算機科學;2013年S2期
3 李霞;蔣盛益;;基于DOM樹及行文本統(tǒng)計去噪的網(wǎng)頁文本抽取技術(shù)[J];山東大學學報(理學版);2012年03期
4 毛先領(lǐng);何靖;閆宏飛;;網(wǎng)頁去噪:研究綜述[J];計算機研究與發(fā)展;2010年12期
5 歐健文,董守斌,蔡斌;模板化網(wǎng)頁主題信息的提取方法[J];清華大學學報(自然科學版);2005年S1期
6 荊濤,左萬利;基于可視布局信息的網(wǎng)頁噪音去除算法[J];華南理工大學學報(自然科學版);2004年S1期
7 張志剛;陳靜;李曉明;;一種HTML網(wǎng)頁凈化方法[J];情報學報;2004年04期
相關(guān)碩士學位論文 前5條
1 馬金娜;基于DOM樹節(jié)點重要度的WEB主題信息提取研究[D];西南大學;2016年
2 王迎;基于XML用戶自定義需求的WEB信息提取研究[D];西南大學;2014年
3 邵振凱;Web網(wǎng)頁去噪及信息提取算法的研究與應用[D];安徽理工大學;2013年
4 張瑞雪;基于DOM樹的網(wǎng)頁相似度研究與應用[D];大連理工大學;2011年
5 徐超;基于DOM的網(wǎng)頁凈化方法研究[D];中國石油大學;2009年
,本文編號:2052092
本文鏈接:http://sikaile.net/guanlilunwen/ydhl/2052092.html