基于Hadoop的Web頁面正文抽取技術(shù)的研究

發(fā)布時間：2018-07-17 07:51

【摘要】：隨著互聯(lián)網(wǎng)技術(shù)的快速發(fā)展和網(wǎng)絡用戶不斷增多,網(wǎng)頁信息量呈井噴式增長。Web信息抽取現(xiàn)已經(jīng)成為當前的研究熱點之一。當前Web信息是網(wǎng)絡用戶獲取信息的重要來源,由于Web信息的動態(tài)變化性,在數(shù)量巨大的網(wǎng)絡信息庫中用戶往往無法快速的捕捉網(wǎng)頁中的正文信息。如何從巨大的互聯(lián)網(wǎng)資源庫中快速并且準確的對頁面中的噪音進行過濾,抽取出網(wǎng)頁中對用戶有用的信息是當前抽取領域的難題。本文提出的基于Hadoop的Web頁面正文抽取方法正是解決上述問題的方法之一。論文研究如何在面對海量規(guī)模數(shù)據(jù)的Web頁面的情況下,確保Web頁面正文抽取的高效性和準確性。研究內(nèi)容主要包含兩部分:在第一部分中,本文分析現(xiàn)有的基于視覺信息的分塊方法,并對原算法的分隔迭代過程進行改進,生成語義較為完整的網(wǎng)頁信息塊且形成網(wǎng)頁視覺塊樹。在第二部分中,本文充分利用網(wǎng)頁塊的樣式、內(nèi)容、詞頻等特征并進行分析,根據(jù)重要度進行正文網(wǎng)頁塊識別。在綜合本文研究內(nèi)容的基礎上,分析典型的系統(tǒng)結(jié)構(gòu)特點,設計實現(xiàn)基于Hadoop的Web頁面正文抽取系統(tǒng)。對系統(tǒng)進行數(shù)據(jù)源的測試,實驗結(jié)果表明本文提出的信息抽取算法有較好地準確率以及較高的性能。該系統(tǒng)良好的解決海量網(wǎng)頁的抽取問題。本文提出的基于Hadoop的抽取方法為海量數(shù)據(jù)模型提供了新的解決思路,分布式計算模型能夠較好的解決性能問題。
[Abstract]:With the rapid development of Internet technology and the increasing of network users, Web information extraction has become one of the research hotspots. At present, Web information is an important source for web users to obtain information. Because of the dynamic variation of Web information, users often can not capture the text information in web pages quickly in a large number of network information databases. How to filter the noise quickly and accurately from the huge Internet resource bank and extract the useful information from the web page is a difficult problem in the field of extraction. The method of Web page text extraction based on Hadoop proposed in this paper is one of the methods to solve the above problems. This paper studies how to ensure the efficiency and accuracy of Web page text extraction in the face of massive data. In the first part, this paper analyzes the existing block methods based on visual information, and improves the separated iterative process of the original algorithm. A web page information block with complete semantics is generated and a web page visual block tree is formed. In the second part, we make full use of the style, content, word frequency and other features of the web page block, and analyze it, and identify the text page block according to the importance degree. On the basis of synthesizing the contents of this paper, this paper analyzes the characteristics of typical system structure, and designs and implements a Web page text extraction system based on Hadoop. The experimental results show that the proposed information extraction algorithm has good accuracy and high performance. The system can solve the problem of massive web page extraction. The proposed extraction method based on Hadoop provides a new solution for the massive data model, and the distributed computing model can solve the performance problem better.
【學位授予單位】：南京郵電大學
【學位級別】：碩士
【學位授予年份】：2017
【分類號】：TP391.1;TP393.09

【參考文獻】

相關(guān)期刊論文前10條

1 王海艷;曹攀;;基于節(jié)點屬性與正文內(nèi)容的海量Web信息抽取方法[J];通信學報;2016年10期

2 張義;李治江;;基于高斯詞長特征的中文分詞方法[J];中文信息學報;2016年05期

3 歐石燕;唐振貴;蘇翡斐;;面向信息檢索的術(shù)語服務構(gòu)建與應用研究[J];中國圖書館學報;2016年02期

4 孫學波;張大偉;;一種基于分隔條的網(wǎng)頁分塊算法[J];計算機應用與軟件;2014年07期

5 吳秦;胡麗娟;梁久禎;;基于分塊重要度和二維條件隨機場的Web信息抽取[J];南京大學學報(自然科學);2014年01期

6 汪洋;帥建梅;陳志剛;;基于海量信息過濾的微博熱詞抽取方法[J];計算機系統(tǒng)應用;2012年11期

7 張云雷;;一種基于信息熵的web信息提取的方法研究[J];科技資訊;2012年22期

8 劉華星;楊庚;;HTML5——下一代Web開發(fā)標準研究[J];計算機技術(shù)與發(fā)展;2011年08期

9 李文立;王樂超;宋春雷;;基于HTML樹和模板的文獻信息提取方法研究[J];計算機應用研究;2010年12期

10 朱紅燦;陳能華;周永紅;;計算Web頁面信息熵的方法[J];計算機工程與設計;2010年01期

相關(guān)碩士學位論文前7條

1 雙哲;基于隱馬爾科夫模型在網(wǎng)頁信息抽取中的研究與應用[D];華東師范大學;2016年

2 王慧娟;基于Hadoop的Deep Web查詢結(jié)果自動抽取研究[D];重慶大學;2014年

3 穆瓊;基于視覺特征的網(wǎng)頁清洗研究與實現(xiàn)[D];北京郵電大學;2014年

4 張奇;基于信息熵的Web信息抽取技術(shù)研究[D];廣東工業(yè)大學;2013年

5 萬文宏;基于Nutch的分布式搜索引擎的研究與優(yōu)化[D];武漢理工大學;2013年

6 胡波;基于視覺語義塊的網(wǎng)頁正文提取算法研究[D];浙江大學;2013年

7 胡金棟;網(wǎng)頁正文提取及去重技術(shù)研究[D];浙江大學;2011年

，

本文編號：2129673

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/guanlilunwen/ydhl/2129673.html

上一篇：基于ReliefF的入侵特征選擇算法研究
下一篇：基于SDN的可計算網(wǎng)絡體系架構(gòu)的研究及Q類業(yè)務保障的實現(xiàn)

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于Hadoop的Web頁面正文抽取技術(shù)的研究