網(wǎng)頁中實體表格信息抽取方法的研究

發(fā)布時間：2018-03-18 07:29

本文選題：本體生成　切入點：信息提取　出處：《北京工業(yè)大學(xué)》2016年碩士論文　論文類型：學(xué)位論文

【摘要】：隨著互聯(lián)網(wǎng)的迅猛發(fā)展,網(wǎng)頁的信息量呈指數(shù)型增長,逐頁瀏覽信息已經(jīng)不能滿足人們的要求,信息抽取技術(shù)應(yīng)運(yùn)而生。信息抽取技術(shù)使人們不用進(jìn)一步人工篩選符合自己需求的內(nèi)容而是直接幫助人們從海量網(wǎng)絡(luò)數(shù)據(jù)中獲取有價值的信息。網(wǎng)頁信息提取技術(shù)主要圍繞兩個方向展開,包裝器和結(jié)構(gòu)識別。前者的缺點在于對網(wǎng)頁的結(jié)構(gòu)依賴性強(qiáng),可重用性差,通用性差。本文則是結(jié)構(gòu)識別的一種,該方法對網(wǎng)頁中半結(jié)構(gòu)化信息能良好的定位和識別,并且對大多數(shù)網(wǎng)頁具有通用性,生成的結(jié)果能直接應(yīng)用于本體生成,實用價值高。本文所研究的抽取系統(tǒng)中實現(xiàn)的爬蟲是一個增量型的、深度優(yōu)先爬取的定向爬蟲。它通過配置文件來生成爬取任務(wù),一個配置文件對應(yīng)一個爬取任務(wù)。配置文件有特定的格式和配置字段,由人工編輯生成,只需配置大約十多個字段,就可以完成對于特定網(wǎng)站、特定領(lǐng)域、特定主題的內(nèi)容的定向爬取配置。對網(wǎng)頁進(jìn)行清洗之后,本文針對有TABLE標(biāo)簽的表格提出了基于啟發(fā)式規(guī)則的實體定位算法和基于網(wǎng)頁URL歸類的實體定位算法。基于標(biāo)簽特征、表格結(jié)構(gòu)特征、表格內(nèi)容特征本文總結(jié)了六條規(guī)則,依次通過對六條規(guī)則生成字符串,然后采用有窮自動機(jī)來識別字符串,最后根據(jù)停留在不同的狀態(tài)判斷是否是真表格。為提高定位的準(zhǔn)確度,本文提出了URL歸類實體定位法,通過對URL的類別分類,能將不含有表格的網(wǎng)頁去除。這兩種方法的結(jié)合使得表格定位具有較高的準(zhǔn)確度。同時,本文針對有特殊符號的無TABLE標(biāo)簽的表格制定了啟發(fā)式規(guī)則,針對用標(biāo)簽組織的無TABLE標(biāo)簽的表格提出了基于DOM樹和啟發(fā)式規(guī)則相結(jié)合的定位方法。在表格結(jié)構(gòu)識別中,本文通過對表格屬性名和屬性值類型的不同構(gòu)建了類型樹,通過計算單元格之間的類型差異判斷出表格的展開方式。同時,本文提出了將表格數(shù)字化,通過計算單元格之間長度差異判斷出表格的展開方式,將兩者判斷的結(jié)果賦予不同的權(quán)值,最終判別出表格為橫向展開還是縱向展開。并且本文根據(jù)類型差異和結(jié)構(gòu)差異判斷出表頭所跨越的行數(shù)或列數(shù)。
[Abstract]:With the rapid development of the Internet, the amount of information on web pages is increasing exponentially. Browsing information page by page can no longer meet the requirements of people. Information extraction technology arises as the times require. Information extraction technology enables people to obtain valuable information directly from massive network data without further manual screening of content that meets their own needs. The technique mainly revolves around two directions. Wrapper and structure recognition. The former has the disadvantages of strong structural dependence, poor reusability and poor versatility. This paper is a kind of structure recognition method, which can locate and recognize the semi-structured information in web pages. The result can be directly applied to ontology generation, which is of high practical value. The crawler implemented in the extraction system studied in this paper is an incremental one. Deep-first crawling oriented crawler. It generates crawling tasks through configuration files, and a configuration file corresponds to a crawling task. The profile has a specific format and configuration field, which is generated by manual editing. With only about a dozen fields configured, you can complete the directed crawling configuration for the content of a particular site, domain, or topic. In this paper, an entity location algorithm based on heuristic rules and an entity location algorithm based on web page URL categorization are proposed for tables with TABLE tags. This paper summarizes six rules based on label features, table structure features and table content features. In order to improve the accuracy of localization, the URL classifying entity localization method is proposed in this paper. The string is generated by six rules in turn, then the finite automata are used to identify the strings. Finally, according to the different states, the paper determines whether the string is true or not. By classifying the URL categories, the web pages without tables can be removed. The combination of these two methods makes the table positioning more accurate. At the same time, this paper formulates heuristic rules for tables without TABLE tags with special symbols. Based on the combination of DOM tree and heuristic rules, this paper proposes a new method to locate tables without TABLE tags organized by tags. In the recognition of table structure, a type tree is constructed by different attribute names and attribute value types. At the same time, this paper proposes to digitize the table and calculate the length difference between cells to determine the expansion mode of the table. The results of the two judgments are given different weights, and finally the table is determined to be horizontal or vertical, and the number of rows or columns crossed by the header is determined according to the type difference and the structure difference.
【學(xué)位授予單位】：北京工業(yè)大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2016
【分類號】：TP391.1

【相似文獻(xiàn)】

相關(guān)期刊論文前10條

1 金穎云;怎樣把表格里的行數(shù)據(jù)轉(zhuǎn)成列數(shù)據(jù)[J];電腦知識與技術(shù);2002年07期

2 ;善用表格讓辦公更輕松[J];電腦愛好者;2009年18期

3 金穎云;;怎樣把表格里的行數(shù)據(jù)轉(zhuǎn)成列數(shù)據(jù)[J];軟件;2003年11期

4 陳桂鑫;表格數(shù)據(jù) 頁頁心中有數(shù)[J];電腦愛好者;2004年24期

5 毛毛蟲;;Word表格行數(shù)據(jù)移動有快招[J];電腦迷;2008年12期

6 阮慧寧;;表格中數(shù)據(jù)的編輯加工技巧[J];科技與出版;2011年07期

7 徐群;;通用表格生成系統(tǒng)的實現(xiàn)[J];計算機(jī)光盤軟件與應(yīng)用;2012年18期

8 張平,黃尚康,潘保昌;一種復(fù)雜表格識別和處理方法[J];電子科學(xué)學(xué)刊;1994年03期

9 梁虹，李天牧;一種通用的表格自動處理系統(tǒng)[J];云南大學(xué)學(xué)報(自然科學(xué)版);1995年01期

10 長耳朵;;輕松制表[J];電腦界.應(yīng)用文萃;2001年02期

相關(guān)會議論文前6條

1 靳忠;李橫;李萌;;ASP.NET中動態(tài)表格的實現(xiàn)[A];全國ISNBM學(xué)術(shù)交流會暨電腦開發(fā)與應(yīng)用創(chuàng)刊20周年慶祝大會論文集[C];2005年

2 張慧;李學(xué)慶;;基于模型驅(qū)動的表格識別[A];第六屆和諧人機(jī)環(huán)境聯(lián)合學(xué)術(shù)會議（HHME2010)、第19屆全國多媒體學(xué)術(shù)會議（NCMT2010）、第6屆全國人機(jī)交互學(xué)術(shù)會議（CHCI2010）、第5屆全國普適計算學(xué)術(shù)會議（PCC2010）論文集[C];2010年

3 王輝;楊凱;郎士寧;馮少華;王月蓉;;.Net控制Excel自動生成表格的應(yīng)用研究[A];計算機(jī)研究新進(jìn)展（2010）——河南省計算機(jī)學(xué)會2010年學(xué)術(shù)年會論文集[C];2010年

4 高景;;“Word計算和排序表格數(shù)據(jù)”教學(xué)設(shè)計[A];2012年河北省教師教育學(xué)會教學(xué)設(shè)計主題論壇論文集[C];2012年

5 白慧敏;;基于Moodle平臺的《表格數(shù)據(jù)的圖形化》網(wǎng)絡(luò)教學(xué)案例[A];河北省教師教育學(xué)會第二屆中小學(xué)教師教學(xué)案例展論文集[C];2013年

6 袁鴻雁;;Web表格信息抽取技術(shù)的研究[A];2008'中國信息技術(shù)與應(yīng)用學(xué)術(shù)論壇論文集（一）[C];2008年

相關(guān)重要報紙文章前4條

1 伊禮俊;如何讓海量數(shù)據(jù)自動進(jìn)電腦[N];中國計算機(jī)報;2007年

2 江蘇羅松林;Word 2000表格中的計算方法[N];中國電腦教育報;2001年

3 本報記者張智江;中外管理軟件大比拼[N];通信信息報;2003年

4 河北劉勇;Help Me[N];電腦報;2004年

相關(guān)博士學(xué)位論文前1條

1 史廣順;文檔圖像中表格結(jié)構(gòu)的自動定位與分析[D];南開大學(xué);2003年

相關(guān)碩士學(xué)位論文前10條

1 劉華西;基于眾包的網(wǎng)絡(luò)表格語義恢復(fù)[D];北京交通大學(xué);2016年

2 曹貞興;Web表格數(shù)據(jù)提取與分析系統(tǒng)的設(shè)計與實現(xiàn)[D];哈爾濱工業(yè)大學(xué);2016年

3 劉巖;網(wǎng)頁中實體表格信息抽取方法的研究[D];北京工業(yè)大學(xué);2016年

4 王小鳳;表格數(shù)據(jù)的采集和處理[D];蘇州大學(xué);2002年

5 羅靜;互聯(lián)網(wǎng)表格數(shù)據(jù)的語義恢復(fù)[D];北京交通大學(xué);2014年

6 任向冉;網(wǎng)絡(luò)表格的實體列發(fā)現(xiàn)與標(biāo)識[D];北京交通大學(xué);2015年

7 任紅偉;網(wǎng)絡(luò)表格間的關(guān)聯(lián)關(guān)系發(fā)現(xiàn)[D];北京交通大學(xué);2015年

8 潘小燕;半結(jié)構(gòu)化文本中的表格信息抽取技術(shù)的研究[D];哈爾濱工業(yè)大學(xué);2007年

9 司明;表格識別的研究[D];西安科技大學(xué);2009年

10 唐皓瑾;一種面向PDF文件的表格數(shù)據(jù)抽取方法的研究與實現(xiàn)[D];北京郵電大學(xué);2015年

，

本文編號：1628573

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1628573.html

上一篇：TMS:一種新的海量數(shù)據(jù)多維選擇Top-k查詢算法
下一篇：算法推送:信息私人定制的“個性化”圈套

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

網(wǎng)頁中實體表格信息抽取方法的研究