基于改進(jìn)DOM樹(shù)的主題型網(wǎng)頁(yè)去噪聲研究
本文選題:主題型網(wǎng)頁(yè) + DOM樹(shù) ; 參考:《西南大學(xué)》2017年碩士論文
【摘要】:隨著Internet的高速發(fā)展,Web上承載的網(wǎng)頁(yè)數(shù)據(jù)也與日俱增。一個(gè)普通網(wǎng)頁(yè)上包含的數(shù)據(jù)一般可以分成兩部分:內(nèi)容塊和噪聲塊,其中噪聲塊主要包括網(wǎng)頁(yè)頂部或側(cè)邊的導(dǎo)航欄、四周的廣告條和底部的版權(quán)信息等。噪音數(shù)據(jù)幾乎占據(jù)網(wǎng)頁(yè)的一半比例,并且這個(gè)比例還在持續(xù)增長(zhǎng)。網(wǎng)頁(yè)噪音數(shù)據(jù)的持續(xù)增長(zhǎng)不僅使用戶更難獲取與主題相關(guān)的信息,而且加大用戶搜索有用信息的效率,因此如何快速去除網(wǎng)頁(yè)上與主題信息無(wú)關(guān)的噪音信息顯得尤為重要。網(wǎng)頁(yè)去噪的方法一般分為基于網(wǎng)頁(yè)模板的去噪方法、基于網(wǎng)頁(yè)視覺(jué)信息的去噪方法和基于DOM樹(shù)的去噪方法。本文主要基于DOM樹(shù)結(jié)構(gòu)對(duì)主題型網(wǎng)頁(yè)進(jìn)行去噪處理。在以往的基于DOM樹(shù)的網(wǎng)頁(yè)去噪研究中,研究者大多根據(jù)設(shè)定的規(guī)則首先將DOM樹(shù)節(jié)點(diǎn)劃分不同類型,然后根據(jù)節(jié)點(diǎn)類型判斷哪些是噪音節(jié)點(diǎn)。但根據(jù)某單一因素便過(guò)早將節(jié)點(diǎn)劃分不同類型,可能會(huì)造成節(jié)點(diǎn)類型誤判,從而影響后續(xù)的去噪效果。另外本文通過(guò)分析國(guó)內(nèi)幾大門戶網(wǎng)站的二級(jí)詳情頁(yè),發(fā)現(xiàn)主題型的網(wǎng)頁(yè)具有主題突出、文字內(nèi)容較多、圖片和鏈接較少等特征。針對(duì)以往基于DOM樹(shù)研究的不足和主題型網(wǎng)頁(yè)的結(jié)構(gòu)特點(diǎn)、文本特點(diǎn)、標(biāo)簽語(yǔ)義特點(diǎn)等,本文在傳統(tǒng)DOM樹(shù)基礎(chǔ)上構(gòu)建一種改進(jìn)的DOM樹(shù)模型,并基于此改進(jìn)的DOM樹(shù)模型給出了主題型網(wǎng)頁(yè)的去噪方法,研究的主要內(nèi)容如下:(1)將HTML標(biāo)簽依據(jù)與主題相關(guān)性和節(jié)點(diǎn)劃分粒度分為主題塊標(biāo)簽和非主題塊標(biāo)簽。綜合考慮主題型網(wǎng)頁(yè)中標(biāo)簽與主題語(yǔ)義關(guān)聯(lián)度、節(jié)點(diǎn)內(nèi)鏈接特征值、節(jié)點(diǎn)內(nèi)文本長(zhǎng)度、節(jié)點(diǎn)內(nèi)子節(jié)點(diǎn)純文本節(jié)點(diǎn)數(shù)、節(jié)點(diǎn)內(nèi)圖片個(gè)數(shù),在構(gòu)建DOM樹(shù)時(shí)依次給Node節(jié)點(diǎn)添加自定義屬性tagDeg、linkVal、text Len、textNum、picNum。(2)提出了改進(jìn)DOM樹(shù)模型。首先把HTML文檔解析成DOM樹(shù)結(jié)構(gòu),然后遍歷DOM樹(shù)依次給DOM樹(shù)中節(jié)點(diǎn)添加自定義屬性,在對(duì)DOM內(nèi)非主題塊節(jié)點(diǎn)進(jìn)行合并時(shí),同時(shí)也對(duì)節(jié)點(diǎn)內(nèi)新添加屬性tagDeg和link Val的值進(jìn)行累加計(jì)算,最后構(gòu)建只包含主題塊節(jié)點(diǎn)的改進(jìn)的DOM樹(shù)模型。(3)給出了基于改進(jìn)DOM樹(shù)模型的網(wǎng)頁(yè)去噪方法。該方法主要包括網(wǎng)頁(yè)預(yù)處理、構(gòu)建改進(jìn)DOM樹(shù)模型和改進(jìn)DOM樹(shù)網(wǎng)頁(yè)去噪。其中,改進(jìn)DOM樹(shù)網(wǎng)頁(yè)去噪中通過(guò)分析對(duì)比節(jié)點(diǎn)內(nèi)自定義屬性值與設(shè)定的閾值,從而確定并刪除噪音節(jié)點(diǎn),達(dá)到網(wǎng)頁(yè)去噪的目的。最后通過(guò)實(shí)驗(yàn)分析,表明該方法對(duì)主題型網(wǎng)頁(yè)具有較好的去噪效果。
[Abstract]:With the rapid development of the Internet, the web data on the Web is also increasing. The data contained on an ordinary web page can be divided into two parts: the content block and the noise block, where the noise block mainly includes the navigation bar at the top or side of the page, the advertising bar around the page and the copyright information at the bottom. Noise data account for almost half of all web pages, and that proportion continues to grow. The continuous growth of noise data not only makes it more difficult for users to obtain theme-related information, but also increases the efficiency of searching useful information. Therefore, how to quickly remove the noise information that is not related to topic information is particularly important. The methods of web page denoising are generally divided into three kinds: one is based on page template, the other is based on visual information and Dom tree. This paper mainly based on Dom tree structure to the theme web page denoising processing. In the previous researches of Web page denoising based on Dom tree, most researchers divide Dom tree nodes into different types according to the set rules, and then judge which noise nodes are noise nodes according to the node types. However, according to a single factor, nodes can be divided into different types prematurely, which may result in node type misjudgment, which will affect the effect of subsequent de-noising. In addition, by analyzing the secondary detail pages of several domestic portals, it is found that the topic-oriented web pages have the characteristics of prominent themes, more text content, less pictures and links, and so on. In view of the shortcomings of previous researches based on Dom tree and the structural characteristics, text features and label semantics of themed web pages, this paper constructs an improved Dom tree model based on the traditional Dom tree. Based on the improved Dom tree model, the denoising method of topic web pages is presented. The main contents of the research are as follows: (1) the HTML tags are divided into topic block tags and non-topic block tags according to their relevance and node granularity. Considering the semantic correlation degree between label and topic, the link eigenvalue, the length of text, the number of pure text nodes, the number of images in nodes. In the process of building Dom tree, we add the custom attribute tagDeglinkValo text LentextNump Numu to Node node in turn. (2) an improved Dom tree model is proposed. Firstly, the HTML document is parsed into a Dom tree structure, then traversing the Dom tree to add custom attributes to the nodes in the Dom tree in turn. When merging the non-topic block nodes in the Dom, it also accumulates the values of the newly added attributes tagDeg and link Val in the nodes. Finally, an improved Dom tree model containing only topic block nodes is constructed. (3) the method of web page denoising based on the improved Dom tree model is presented. This method mainly includes page preprocessing, building improved Dom tree model and improving Dom tree web page denoising. In the improved Dom tree web page denoising, by analyzing and comparing the custom attribute value and the set threshold in the node, the noise node can be determined and deleted to achieve the purpose of web page denoising. Finally, the experimental results show that the method has a better denoising effect on the theme web pages.
【學(xué)位授予單位】:西南大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:TP393.092
【參考文獻(xiàn)】
相關(guān)期刊論文 前7條
1 謝方立;周國(guó)民;王健;;基于節(jié)點(diǎn)類型標(biāo)注的網(wǎng)頁(yè)主題信息抽取方法[J];計(jì)算機(jī)科學(xué);2016年S2期
2 彭紅超;童名文;鄒軍華;郝秋紅;;基于規(guī)則的網(wǎng)頁(yè)分割預(yù)處理算法研究[J];計(jì)算機(jī)科學(xué);2013年S2期
3 李霞;蔣盛益;;基于DOM樹(shù)及行文本統(tǒng)計(jì)去噪的網(wǎng)頁(yè)文本抽取技術(shù)[J];山東大學(xué)學(xué)報(bào)(理學(xué)版);2012年03期
4 毛先領(lǐng);何靖;閆宏飛;;網(wǎng)頁(yè)去噪:研究綜述[J];計(jì)算機(jī)研究與發(fā)展;2010年12期
5 歐健文,董守斌,蔡斌;模板化網(wǎng)頁(yè)主題信息的提取方法[J];清華大學(xué)學(xué)報(bào)(自然科學(xué)版);2005年S1期
6 荊濤,左萬(wàn)利;基于可視布局信息的網(wǎng)頁(yè)噪音去除算法[J];華南理工大學(xué)學(xué)報(bào)(自然科學(xué)版);2004年S1期
7 張志剛;陳靜;李曉明;;一種HTML網(wǎng)頁(yè)凈化方法[J];情報(bào)學(xué)報(bào);2004年04期
相關(guān)碩士學(xué)位論文 前5條
1 馬金娜;基于DOM樹(shù)節(jié)點(diǎn)重要度的WEB主題信息提取研究[D];西南大學(xué);2016年
2 王迎;基于XML用戶自定義需求的WEB信息提取研究[D];西南大學(xué);2014年
3 邵振凱;Web網(wǎng)頁(yè)去噪及信息提取算法的研究與應(yīng)用[D];安徽理工大學(xué);2013年
4 張瑞雪;基于DOM樹(shù)的網(wǎng)頁(yè)相似度研究與應(yīng)用[D];大連理工大學(xué);2011年
5 徐超;基于DOM的網(wǎng)頁(yè)凈化方法研究[D];中國(guó)石油大學(xué);2009年
,本文編號(hào):2052092
本文鏈接:http://sikaile.net/guanlilunwen/ydhl/2052092.html