當(dāng)前位置：主頁(yè) > 管理論文 > 移動(dòng)網(wǎng)絡(luò)論文 >

基于XML用戶自定義需求的WEB信息提取研究

發(fā)布時(shí)間：2019-07-01 18:38

【摘要】：隨著近些年互聯(lián)網(wǎng)的飛速發(fā)展,Internet已經(jīng)發(fā)展成為一個(gè)龐大的發(fā)布和共享信息資源的平臺(tái)。但是如何從海量、無(wú)結(jié)構(gòu)或半結(jié)構(gòu)化的數(shù)據(jù)中快速、高效地獲取用戶所需的信息仍然是亟待解決的熱點(diǎn)問(wèn)題,因此WEB信息提取技術(shù)應(yīng)運(yùn)而生。目前學(xué)者們已經(jīng)進(jìn)行了大量的研究工作,但現(xiàn)有的技術(shù)仍然存在諸多不足之處：提取方法過(guò)于專業(yè),不僅增加了用戶語(yǔ)義理解的負(fù)擔(dān),而且不便于用戶使用；在提取過(guò)程中難以及時(shí)獲取用戶的反饋,影響提取效果；提取內(nèi)容越復(fù)雜,提取規(guī)則的健壯性越差。基于此,本文在對(duì)XML及相關(guān)標(biāo)準(zhǔn)和現(xiàn)有基于XML提取方法深入研究的基礎(chǔ)上,提出了一種基于XML用戶自定義需求的WEB信息提取方法。研究工作包括為以下幾方面內(nèi)容： (1)對(duì)待提取頁(yè)面進(jìn)行處理。HTML頁(yè)面經(jīng)過(guò)預(yù)處理過(guò)濾掉無(wú)關(guān)信息和代碼,轉(zhuǎn)換為格式規(guī)范的XML文檔,為使用戶清晰掌握頁(yè)面結(jié)構(gòu),將XML文檔解析生成可視化的DOM樹(shù)形式,在節(jié)點(diǎn)轉(zhuǎn)換的過(guò)程中,標(biāo)記每個(gè)節(jié)點(diǎn)類型,并計(jì)算其路徑表達(dá)式,為樣本映射和生成提取規(guī)則做準(zhǔn)備。 (2)實(shí)現(xiàn)用戶的提取需求的獲取。研究通過(guò)定義目標(biāo)描述待提取數(shù)據(jù)節(jié)點(diǎn)間的層次關(guān)系,并且以此作為提取信息輸出時(shí)的樣式結(jié)構(gòu)。用戶標(biāo)記的樣本則作為提取規(guī)則的生成依據(jù),樣本按照映射規(guī)則以結(jié)構(gòu)映射或內(nèi)容映射的方式向目標(biāo)結(jié)構(gòu)映射,從而得到待提取數(shù)據(jù)的節(jié)點(diǎn)類型信息和位置信息。 (3)實(shí)現(xiàn)提取規(guī)則的構(gòu)造。提取規(guī)則由一個(gè)或多個(gè)匹配目標(biāo)結(jié)構(gòu)每層節(jié)點(diǎn)的模板構(gòu)成。模板根據(jù)目標(biāo)結(jié)構(gòu)根節(jié)點(diǎn)是否存在結(jié)構(gòu)映射分別進(jìn)行構(gòu)造。根節(jié)點(diǎn)存在結(jié)構(gòu)映射,利用樣本結(jié)構(gòu)映射的class屬性匹配全文同類別節(jié)點(diǎn),并利用相對(duì)路徑覆蓋父子關(guān)系和祖先后代關(guān)系,遞歸生成每層節(jié)點(diǎn)模板。根節(jié)點(diǎn)不存在結(jié)構(gòu)映射,通過(guò)其子節(jié)點(diǎn)獲取公共路徑作為模板匹配的起點(diǎn),由于該起點(diǎn)位置是唯一的,因此提取僅為樣本數(shù)據(jù)。最后通過(guò)對(duì)比實(shí)驗(yàn),驗(yàn)證了本文提取方法的有效性,證明了該方法提取效果優(yōu)于現(xiàn)有的兩種方法。當(dāng)提取內(nèi)容結(jié)構(gòu)復(fù)雜時(shí),提取規(guī)則具有較好的健壯性。同時(shí)實(shí)現(xiàn)了該方法的原型系統(tǒng),通過(guò)系統(tǒng)演示表明,用戶不僅能夠直觀的觀測(cè)到信息提取的整個(gè)過(guò)程,而且可以及時(shí)確定提取結(jié)果是否準(zhǔn)確并能夠方便地進(jìn)行修改。
[Abstract]:With the rapid development of the Internet in recent years, Internet has become a huge platform for publishing and sharing information resources. However, how to obtain the information needed by users quickly and efficiently from massive, unstructured or semi-structured data is still a hot issue to be solved, so WEB information extraction technology emerges as the times require. At present, scholars have done a lot of research work, but the existing technology still has many shortcomings: the extraction method is too professional, not only increases the burden of user semantic understanding, but also is not easy for users to use; in the extraction process, it is difficult to obtain user feedback in time, affecting the extraction effect; the more complex the extraction content, the worse the robustness of the extraction rules. Based on this, based on the in-depth study of XML and related standards and the existing XML extraction methods, a WEB information extraction method based on XML user custom requirements is proposed in this paper. The research work includes the following aspects: (1) the extracted page is processed. The HTML page filters out the unrelated information and code after preprocessing and converts it into a format-standardized XML document. In order to make the user clearly master the page structure, the XML document is parsed to generate a visual DOM tree form. In the process of node conversion, each node type is marked and its path expression is calculated. Prepare for sample mapping and generation of extraction rules. (2) to realize the acquisition of users' extraction requirements. In this paper, the hierarchical relationship between the data nodes to be extracted is described by defining the target, and it is used as the style structure of the extraction information output. The sample of user tag is used as the basis of extraction rule generation, and the sample maps to the target structure in the way of structure mapping or content mapping according to the mapping rule, so as to obtain the node type information and location information of the data to be extracted. (3) the construction of extraction rules is realized. The extraction rule consists of one or more templates for each layer of the matching target structure. The template is constructed according to whether there is a structural mapping in the root node of the target structure. There is a structural mapping in the root node. The class attribute of the sample structure mapping is used to match the full text node of the same class, and the relative path is used to cover the parent-child relationship and the ancestor and descendant relationship, and each layer of node template is generated recursively. There is no structure mapping in the root node, and the common path is obtained by its child nodes as the starting point of template matching. Because the starting point position is unique, the extraction is only sample data. Finally, the effectiveness of the proposed method is verified by comparative experiments, and it is proved that the extraction effect of this method is better than that of the existing two methods. When the extraction content structure is complex, the extraction rules have good robustness. At the same time, the prototype system of the method is realized, and the system demonstration shows that the user can not only intuitively observe the whole process of information extraction, but also determine whether the extraction result is accurate and can be modified conveniently.
【學(xué)位授予單位】：西南大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2014
【分類號(hào)】：TP391.1;TP393.092

【參考文獻(xiàn)】

相關(guān)期刊論文前10條

1 李文華;楊亞仿;吳昊;;基于正則表達(dá)式的HTML信息提取[J];電腦開(kāi)發(fā)與應(yīng)用;2012年04期

2 譚鋒;崔亮亮;;基于XPath的XML數(shù)據(jù)提取的C#實(shí)現(xiàn)[J];電腦知識(shí)與技術(shù);2011年09期

3 王茹,宋瀚濤,陸玉昌;網(wǎng)頁(yè)數(shù)據(jù)自動(dòng)抽取系統(tǒng)[J];計(jì)算機(jī)工程與應(yīng)用;2004年19期

4 賀智平;徐學(xué)洲;李愛(ài)玲;;一種基于信息熵的Web頁(yè)面主題信息抽取方法[J];計(jì)算機(jī)工程與應(yīng)用;2007年04期

5 賈燕花;徐蔚鴻;;K-means聚類和支持向量機(jī)結(jié)合的文本分類研究[J];計(jì)算機(jī)工程與應(yīng)用;2010年22期

6 李超鋒;盧炎生;;基于URL結(jié)構(gòu)和訪問(wèn)時(shí)間的Web頁(yè)面訪問(wèn)相似性度量[J];計(jì)算機(jī)科學(xué);2007年04期

7 劉華;;網(wǎng)頁(yè)信息抽取及建庫(kù)系統(tǒng)C#實(shí)現(xiàn)[J];計(jì)算機(jī)工程;2006年16期

8 曲著偉;李敏強(qiáng);;基于數(shù)據(jù)區(qū)域發(fā)現(xiàn)的信息抽取規(guī)則生成方法[J];計(jì)算機(jī)工程;2009年22期

9 王敬普;林亞平;周順先;岳文;;基于包裝器模型的文本信息抽取[J];計(jì)算機(jī)應(yīng)用;2006年03期

10 陳佳;胡燕;軒艷艷;;一種基于XML的Web信息抽取方法[J];計(jì)算機(jī)與數(shù)字工程;2007年06期

，

本文編號(hào)：2508712

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/guanlilunwen/ydhl/2508712.html

上一篇：云環(huán)境下融合惡意用戶過(guò)濾機(jī)制的信譽(yù)評(píng)估方法
下一篇：支持可信認(rèn)證的移動(dòng)IPSec VPN系統(tǒng)設(shè)計(jì)

論文發(fā)表

·知網(wǎng)|萬(wàn)方|維普|龍?jiān)磡省級(jí)|國(guó)家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于XML用戶自定義需求的WEB信息提取研究