復(fù)雜結(jié)構(gòu)精確Web信息抽取規(guī)則語言與關(guān)鍵技術(shù)研究

發(fā)布時間：2018-03-20 12:01

本文選題：精確Web信息抽取　切入點：深度網(wǎng)頁　出處：《南京大學(xué)》2014年碩士論文　論文類型：學(xué)位論文

【摘要】：互聯(lián)網(wǎng)時代Web已經(jīng)成為各類海量數(shù)據(jù)和信息的主要載體,成為人們獲取大量有用信息的主要數(shù)據(jù)源。當前,電子商務(wù)領(lǐng)域的蓬勃發(fā)展,垂直搜索、社交網(wǎng)絡(luò)的輿情和情感分析等諸多應(yīng)用,都依賴于Web信息抽取技術(shù)來獲得大規(guī)模的網(wǎng)頁數(shù)據(jù),因此Web信息抽取技術(shù)的研究具有重要的研究意義和商業(yè)應(yīng)用價值。Web信息抽取技術(shù)的一個重要研究問題是,研究如何提供一種有效的Web信息抽取規(guī)則以方便快速地表示各種復(fù)雜結(jié)構(gòu)網(wǎng)頁數(shù)據(jù)記錄的抽取邏輯,從而避免硬編碼程序編寫方式來完成數(shù)據(jù)抽取�，F(xiàn)有的Web信息抽取技術(shù)的研究已經(jīng)取得了一定的成就,然而Web頁面技術(shù)的發(fā)展給Web信息抽取技術(shù)領(lǐng)域不斷帶來新的研究課題�，F(xiàn)有Web信息抽取技術(shù)與抽取規(guī)則研究方面還存在以下主要缺點：1)抽取規(guī)則模型和體系設(shè)計方面,缺少對完整的抽取過程和模型的深入研究,難以完成深度網(wǎng)頁的瀏覽導(dǎo)航、數(shù)據(jù)抽取和集成的全過程處理：2)缺少對復(fù)雜結(jié)構(gòu)數(shù)據(jù)記錄模型的研究,降低了Web網(wǎng)頁數(shù)據(jù)抽取技術(shù)的適用范圍；3)抽取規(guī)則語言方面,目前主流的抽取規(guī)則語言缺乏足夠的表達能力來滿足復(fù)雜結(jié)構(gòu)深度Web頁面的數(shù)據(jù)抽取需求；4)針對動態(tài)數(shù)據(jù)頁面模板更新帶來的規(guī)則包裝器失效問題,盡管也有關(guān)于規(guī)則檢測和維護的相關(guān)研究,但是缺乏從規(guī)則體系層面上對規(guī)則檢測、維護、更新的表達能力；5)數(shù)據(jù)抽取特征方面,目前研究利用的網(wǎng)頁DOM樹的結(jié)構(gòu)特征和視覺特征,雖然可以處理大多數(shù)常規(guī)的數(shù)據(jù)抽取應(yīng)用問題,然而對于上述兩種特征無法涵蓋和處理的復(fù)雜結(jié)構(gòu)網(wǎng)頁,在抽取規(guī)則的定義和設(shè)計層面上缺少足夠的特征來提高表達和處理能力；6)缺少對規(guī)則語言執(zhí)行效率的分析和改進,未能從大規(guī)模應(yīng)用場景出發(fā)設(shè)計和改進現(xiàn)有的規(guī)則執(zhí)行過程,提高數(shù)據(jù)抽取的效率。在總結(jié)現(xiàn)有Web信息抽取規(guī)則研究工作的基礎(chǔ)上,針對已有研究,本文主要進行了五個方面的研究工作：1)研究設(shè)計了Web信息抽取全過程模型,可刻畫完整Web信息抽取過程中的瀏覽導(dǎo)航邏輯、數(shù)據(jù)抽取邏輯和數(shù)據(jù)集成邏輯,為設(shè)計兼具瀏覽導(dǎo)航和數(shù)據(jù)集成的綜合處理能力的抽取規(guī)則語言提供指導(dǎo)；2)抽取規(guī)則體系和模型研究：為了能夠更清晰地描述Web信息抽取處理過程,提高Web信息抽取技術(shù)處理的能力,本文研究了Web信息抽取過程中涉及到的各類模型,包括復(fù)雜結(jié)構(gòu)數(shù)據(jù)記錄模型、基于DOM樹結(jié)構(gòu)的自上而下的結(jié)構(gòu)化數(shù)據(jù)抽取過程模型、頁面規(guī)則模型、以及包含規(guī)則生成、規(guī)則檢測、規(guī)則維護和更新的抽取規(guī)則包裝器生命周期模型；3)基于對Web信息抽取基本模型的深入研究,本文研究并提出了層次化的Web信息抽取規(guī)則綜合體系和語言,對每個Web網(wǎng)頁建立“數(shù)據(jù)區(qū)-數(shù)據(jù)記錄-數(shù)據(jù)項”的層次化映射關(guān)系,在每個層次上綜合利用DOM節(jié)點和頁面元素的結(jié)構(gòu)、視覺和語義特征,通過抽取謂詞的組合來提供對各粒度數(shù)據(jù)元素的定位、重組、抽取、細粒度過濾、抽取異常檢測、維護等各種功能規(guī)則,提供強有力的數(shù)據(jù)抽取邏輯語言表達能力；4)根據(jù)多功能化綜合規(guī)則模型和體系,在規(guī)則語言中設(shè)置檢測規(guī)則和維護功能規(guī)則,檢測頁面模板是否發(fā)生變化,對已失效的數(shù)據(jù)抽取規(guī)則進行局部修復(fù)；5)在抽取規(guī)則語言表達能力方面,補充完善了基于語義的數(shù)據(jù)抽取規(guī)則,將語義元素融入到現(xiàn)有的數(shù)據(jù)抽取規(guī)則體系,解決了結(jié)構(gòu)特征和視覺特征難以完成的數(shù)據(jù)抽取處理問題。在以上關(guān)鍵技術(shù)研究基礎(chǔ)上,本文研究實現(xiàn)了抽取規(guī)則執(zhí)行引擎,并設(shè)計實現(xiàn)了一個完成的Web信息抽取原型系統(tǒng)�；趯ι虡I(yè)網(wǎng)站的抽取實驗結(jié)果表明,本文所實現(xiàn)的抽取技術(shù)和抽取規(guī)則語言具有較強的表達和處理能力。
[Abstract]:The age of the Internet Web has become the main carrier of all kinds of data and information, the main data source for people to acquire useful information. At present, the vigorous development of the field of electronic commerce, vertical search, social networking applications of public opinion and sentiment analysis, are dependent on the Web information extraction technology to obtain a large-scale web data. An important research problem so the research of Web information extraction technology has the research significance and commercial value of.Web information extraction technology is an important research, how to provide an effective Web information extraction rules to facilitate rapid said web data extraction logic of various complex structure records, so as to avoid hard encoding program to complete data extraction. Research on Web information extraction technology of the existing has made some achievements, but the development of Web technology to Web information extraction Technology continues to bring a new research topic. The existing Web information extraction technology and extraction rules studies have the following disadvantages: 1) the main extraction rule model and system design, the lack of in-depth study on the extraction process and the complete model, difficult to complete navigation through the deep web, data extraction and integration of the whole process: 2) the lack of complex structured data record model, reduce the scope of the Web web data extraction technology; 3) extraction rule language, the current mainstream extraction rules language lacks the ability to express enough to meet the needs of complex structure depth data extraction Web page; 4) for dynamic data page template update brings rules the wrapper of failure, although there are relevant researches on the detection and maintenance rules, but the lack of rules from system level to regular testing, maintenance, The ability to express updates; 5) data extraction characteristic, the research of "DOM tree structure and visual features, although can handle the data extraction using most conventional, but for these two features can not cover and deal with complex structure web pages, lack of features to improve the expression and processing ability in the definition of the extraction rules and design level; 6) the lack of analysis and improvement of efficiency in the implementation of the rule language, not from large-scale application of the scene design and the existing rules to improve the implementation process, improve the efficiency of data extraction. Based on the existing Web information extraction rules on the research work, based on the existing research, this paper mainly the research work in five aspects: 1) the research and design of the Web information extraction model can describe the whole process, complete Web information extraction in the process of browsing navigation logic, the number of According to the selected logic and data integration logic, provides guidance for the design of both the comprehensive ability of navigation and browsing data integration rules language; 2) rule extraction system and model research: in order to more clearly describe the Web information extraction process, improve the ability of Web information extraction technology, this paper studied various models involved the Web information extraction process, including recording model of complex data structures, data extraction process model of DOM tree structure based on top-down rule, page model, and contains the detection rules generation, rules, rules of maintenance and updating of the extraction rules wrapper lifecycle model; 3) research on Web information extraction based on basic model in this paper, and put forward the hierarchical Web system and comprehensive information extraction rules for each of the Web language, "established" data - data record The hierarchical mapping between the data items, "- recorded at every level of comprehensive utilization of DOM node and page elements, visual and semantic features to provide location, the size of the data elements of the extraction through the combination of predicate reorganization, fine-grained extraction, filtration, extraction of anomaly detection, maintenance and other functions to provide rules. Data extraction logic language strong expression ability; 4) according to the multi function integrated rule model and system, set the detection rules and maintenance function rule in the rule language, test page template is changed, the local repair of data extraction rules has expired; 5) expression ability in the extraction rule language, complement data extraction rules based on semantic and semantic elements into the existing system of data extraction rules, to solve the structural features and visual features of data extraction and processing difficult to complete In the above problem. Based on the research on the key technology, this paper realizes the engine execution of the extraction rules, and the design and implementation of a complete Web information extraction prototype system. The experimental results on the extraction of commercial websites that based on the realization of the extraction technology and extraction rule language with strong expression and processing ability.

【學(xué)位授予單位】：南京大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2014
【分類號】：TP391.1;TP393.092

【相似文獻】

相關(guān)期刊論文前10條

1 張志強,李天柱,張波,陳少飛,郝亞南;基于文檔結(jié)構(gòu)的信息抽取規(guī)則的描述語言比較研究[J];河北大學(xué)學(xué)報(自然科學(xué)版);2004年02期

2 彭祥禮;朱小軍;查志勇;;Web信息抽取和展現(xiàn)系統(tǒng)的設(shè)計與實現(xiàn)[J];電力信息化;2012年02期

3 石倩;陳榮;魯明羽;;基于規(guī)則歸納的信息抽取系統(tǒng)實現(xiàn)[J];計算機工程與應(yīng)用;2008年21期

4 李洋;;基于Web的信息抽取研究[J];吉林工程技術(shù)師范學(xué)院學(xué)報;2007年12期

5 化柏林;劉一寧;鄭彥寧;;針對學(xué)術(shù)定義的抽取規(guī)則構(gòu)建方法研究[J];情報理論與實踐;2011年12期

6 張志遠;徐濤;馮霞;;航班信息抽取規(guī)則的自動生成技術(shù)[J];計算機工程;2011年06期

7 李向陽;戴江山;張亞非;;一種Web信息抽取規(guī)則的優(yōu)化方法[J];蘭州理工大學(xué)學(xué)報;2006年01期

8 曲著偉;李敏強;;基于數(shù)據(jù)區(qū)域發(fā)現(xiàn)的信息抽取規(guī)則生成方法[J];計算機工程;2009年22期

9 魏保子;王儒敬;;基于多Agent技術(shù)的分布式信息抽取系統(tǒng)研究[J];微電子學(xué)與計算機;2008年06期

10 方少卿;胡學(xué)鋼;;基于Web挖掘的信息抽取系統(tǒng)的研究[J];銅陵學(xué)院學(xué)報;2010年04期

相關(guān)會議論文前2條

1 葉娜;羅海濤;朱靖波;張斌;;基于歸納邏輯編程的多槽信息抽取規(guī)則自動學(xué)習(xí)方法[A];全國第八屆計算語言學(xué)聯(lián)合學(xué)術(shù)會議（JSCL-2005）論文集[C];2005年

2 楊文柱;徐林昊;郝亞南;陳少飛;李天柱;;個性化的智能Web查詢助手的設(shè)計與實現(xiàn)[A];第十九屆全國數(shù)據(jù)庫學(xué)術(shù)會議論文集（技術(shù)報告篇）[C];2002年

相關(guān)碩士學(xué)位論文前10條

1 魏武;復(fù)雜結(jié)構(gòu)精確Web信息抽取規(guī)則語言與關(guān)鍵技術(shù)研究[D];南京大學(xué);2014年

2 余淼;主題搜索引擎的信息抽取和索引的研究[D];重慶大學(xué);2007年

3 莊重;WEB信息抽取的研究[D];湖北工業(yè)大學(xué);2009年

4 於媛;Web信息抽取系統(tǒng)SEU-WIE設(shè)計與實現(xiàn)[D];東南大學(xué);2006年

5 張曉歡;基于本體的產(chǎn)品信息抽取系統(tǒng)的研究[D];天津理工大學(xué);2009年

6 狄慧;基于Agent的Web信息抽取研究[D];大連理工大學(xué);2004年

7 陳建輝;基于模式發(fā)現(xiàn)的在線就業(yè)信息抽取[D];內(nèi)蒙古工業(yè)大學(xué);2006年

8 郭德先;一種模式發(fā)現(xiàn)算法及其Web信息抽取應(yīng)用[D];景德鎮(zhèn)陶瓷學(xué)院;2008年

9 霍娜;突發(fā)事件追蹤報道信息抽取的研究[D];山西大學(xué);2012年

10 蔣方玲;地名本體實體與關(guān)系抽取研究[D];天津大學(xué);2012年

，

本文編號：1638986

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/guanlilunwen/ydhl/1638986.html

上一篇：基于大數(shù)據(jù)的網(wǎng)絡(luò)輿情分析系統(tǒng)
下一篇：基于分布式蜜罐的云端安全機制研究

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

復(fù)雜結(jié)構(gòu)精確Web信息抽取規(guī)則語言與關(guān)鍵技術(shù)研究