天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于XML用戶自定義需求的WEB信息提取研究

發(fā)布時間:2019-07-01 18:38
【摘要】:隨著近些年互聯(lián)網(wǎng)的飛速發(fā)展,Internet已經(jīng)發(fā)展成為一個龐大的發(fā)布和共享信息資源的平臺。但是如何從海量、無結(jié)構(gòu)或半結(jié)構(gòu)化的數(shù)據(jù)中快速、高效地獲取用戶所需的信息仍然是亟待解決的熱點問題,因此WEB信息提取技術(shù)應(yīng)運而生。目前學(xué)者們已經(jīng)進行了大量的研究工作,但現(xiàn)有的技術(shù)仍然存在諸多不足之處:提取方法過于專業(yè),不僅增加了用戶語義理解的負擔(dān),而且不便于用戶使用;在提取過程中難以及時獲取用戶的反饋,影響提取效果;提取內(nèi)容越復(fù)雜,提取規(guī)則的健壯性越差。 基于此,本文在對XML及相關(guān)標(biāo)準(zhǔn)和現(xiàn)有基于XML提取方法深入研究的基礎(chǔ)上,提出了一種基于XML用戶自定義需求的WEB信息提取方法。研究工作包括為以下幾方面內(nèi)容: (1)對待提取頁面進行處理。HTML頁面經(jīng)過預(yù)處理過濾掉無關(guān)信息和代碼,轉(zhuǎn)換為格式規(guī)范的XML文檔,為使用戶清晰掌握頁面結(jié)構(gòu),將XML文檔解析生成可視化的DOM樹形式,在節(jié)點轉(zhuǎn)換的過程中,標(biāo)記每個節(jié)點類型,并計算其路徑表達式,為樣本映射和生成提取規(guī)則做準(zhǔn)備。 (2)實現(xiàn)用戶的提取需求的獲取。研究通過定義目標(biāo)描述待提取數(shù)據(jù)節(jié)點間的層次關(guān)系,并且以此作為提取信息輸出時的樣式結(jié)構(gòu)。用戶標(biāo)記的樣本則作為提取規(guī)則的生成依據(jù),樣本按照映射規(guī)則以結(jié)構(gòu)映射或內(nèi)容映射的方式向目標(biāo)結(jié)構(gòu)映射,從而得到待提取數(shù)據(jù)的節(jié)點類型信息和位置信息。 (3)實現(xiàn)提取規(guī)則的構(gòu)造。提取規(guī)則由一個或多個匹配目標(biāo)結(jié)構(gòu)每層節(jié)點的模板構(gòu)成。模板根據(jù)目標(biāo)結(jié)構(gòu)根節(jié)點是否存在結(jié)構(gòu)映射分別進行構(gòu)造。根節(jié)點存在結(jié)構(gòu)映射,利用樣本結(jié)構(gòu)映射的class屬性匹配全文同類別節(jié)點,并利用相對路徑覆蓋父子關(guān)系和祖先后代關(guān)系,遞歸生成每層節(jié)點模板。根節(jié)點不存在結(jié)構(gòu)映射,通過其子節(jié)點獲取公共路徑作為模板匹配的起點,由于該起點位置是唯一的,因此提取僅為樣本數(shù)據(jù)。 最后通過對比實驗,驗證了本文提取方法的有效性,證明了該方法提取效果優(yōu)于現(xiàn)有的兩種方法。當(dāng)提取內(nèi)容結(jié)構(gòu)復(fù)雜時,提取規(guī)則具有較好的健壯性。同時實現(xiàn)了該方法的原型系統(tǒng),通過系統(tǒng)演示表明,用戶不僅能夠直觀的觀測到信息提取的整個過程,而且可以及時確定提取結(jié)果是否準(zhǔn)確并能夠方便地進行修改。
[Abstract]:With the rapid development of the Internet in recent years, Internet has become a huge platform for publishing and sharing information resources. However, how to obtain the information needed by users quickly and efficiently from massive, unstructured or semi-structured data is still a hot issue to be solved, so WEB information extraction technology emerges as the times require. At present, scholars have done a lot of research work, but the existing technology still has many shortcomings: the extraction method is too professional, not only increases the burden of user semantic understanding, but also is not easy for users to use; in the extraction process, it is difficult to obtain user feedback in time, affecting the extraction effect; the more complex the extraction content, the worse the robustness of the extraction rules. Based on this, based on the in-depth study of XML and related standards and the existing XML extraction methods, a WEB information extraction method based on XML user custom requirements is proposed in this paper. The research work includes the following aspects: (1) the extracted page is processed. The HTML page filters out the unrelated information and code after preprocessing and converts it into a format-standardized XML document. In order to make the user clearly master the page structure, the XML document is parsed to generate a visual DOM tree form. In the process of node conversion, each node type is marked and its path expression is calculated. Prepare for sample mapping and generation of extraction rules. (2) to realize the acquisition of users' extraction requirements. In this paper, the hierarchical relationship between the data nodes to be extracted is described by defining the target, and it is used as the style structure of the extraction information output. The sample of user tag is used as the basis of extraction rule generation, and the sample maps to the target structure in the way of structure mapping or content mapping according to the mapping rule, so as to obtain the node type information and location information of the data to be extracted. (3) the construction of extraction rules is realized. The extraction rule consists of one or more templates for each layer of the matching target structure. The template is constructed according to whether there is a structural mapping in the root node of the target structure. There is a structural mapping in the root node. The class attribute of the sample structure mapping is used to match the full text node of the same class, and the relative path is used to cover the parent-child relationship and the ancestor and descendant relationship, and each layer of node template is generated recursively. There is no structure mapping in the root node, and the common path is obtained by its child nodes as the starting point of template matching. Because the starting point position is unique, the extraction is only sample data. Finally, the effectiveness of the proposed method is verified by comparative experiments, and it is proved that the extraction effect of this method is better than that of the existing two methods. When the extraction content structure is complex, the extraction rules have good robustness. At the same time, the prototype system of the method is realized, and the system demonstration shows that the user can not only intuitively observe the whole process of information extraction, but also determine whether the extraction result is accurate and can be modified conveniently.
【學(xué)位授予單位】:西南大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2014
【分類號】:TP391.1;TP393.092

【參考文獻】

相關(guān)期刊論文 前10條

1 李文華;楊亞仿;吳昊;;基于正則表達式的HTML信息提取[J];電腦開發(fā)與應(yīng)用;2012年04期

2 譚鋒;崔亮亮;;基于XPath的XML數(shù)據(jù)提取的C#實現(xiàn)[J];電腦知識與技術(shù);2011年09期

3 王茹,宋瀚濤,陸玉昌;網(wǎng)頁數(shù)據(jù)自動抽取系統(tǒng)[J];計算機工程與應(yīng)用;2004年19期

4 賀智平;徐學(xué)洲;李愛玲;;一種基于信息熵的Web頁面主題信息抽取方法[J];計算機工程與應(yīng)用;2007年04期

5 賈燕花;徐蔚鴻;;K-means聚類和支持向量機結(jié)合的文本分類研究[J];計算機工程與應(yīng)用;2010年22期

6 李超鋒;盧炎生;;基于URL結(jié)構(gòu)和訪問時間的Web頁面訪問相似性度量[J];計算機科學(xué);2007年04期

7 劉華;;網(wǎng)頁信息抽取及建庫系統(tǒng)C#實現(xiàn)[J];計算機工程;2006年16期

8 曲著偉;李敏強;;基于數(shù)據(jù)區(qū)域發(fā)現(xiàn)的信息抽取規(guī)則生成方法[J];計算機工程;2009年22期

9 王敬普;林亞平;周順先;岳文;;基于包裝器模型的文本信息抽取[J];計算機應(yīng)用;2006年03期

10 陳佳;胡燕;軒艷艷;;一種基于XML的Web信息抽取方法[J];計算機與數(shù)字工程;2007年06期

,

本文編號:2508712

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/guanlilunwen/ydhl/2508712.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶76611***提供,本站僅收錄摘要或目錄,作者需要刪除請E-mail郵箱bigeng88@qq.com
黄片免费在线观看日韩| 精品少妇一区二区视频| 99久久免费看国产精品| 日本人妻中出在线观看| 国产精品午夜性色视频| 亚洲深夜精品福利一区| 熟女高潮一区二区三区| 久久精品国产亚洲av麻豆| 我要看日本黄色小视频| 久久精品国产99精品最新| 日韩欧美二区中文字幕| 日韩一区二区三区免费av| 一区二区三区欧美高清| 国产不卡在线免费观看视频| 亚洲一级二级三级精品| 国产免费观看一区二区| 国产人妻熟女高跟丝袜| 色婷婷激情五月天丁香| 国产亚洲精品久久久优势| 欧美黑人精品一区二区在线| 高清亚洲精品中文字幕乱码| 日本深夜福利在线播放| 国产中文另类天堂二区| 国产毛片对白精品看片| 亚洲欧美日本国产有色| 麻豆国产精品一区二区| 亚洲黄色在线观看免费高清| 91亚洲精品综合久久| 国产av一区二区三区久久不卡| 日本少妇三级三级三级| 国产成人国产精品国产三级 | 日韩美女偷拍视频久久| 久久精品国产第一区二区三区| 精品国产亚洲区久久露脸| 久久国产精品亚州精品毛片| 亚洲精品黄色片中文字幕| 久久三级国外久久久三级| 熟女一区二区三区国产| 不卡一区二区在线视频| 国产麻豆精品福利在线| 日韩精品中文字幕亚洲|