基于XML用戶自定義需求的WEB信息提取研究
[Abstract]:With the rapid development of the Internet in recent years, Internet has become a huge platform for publishing and sharing information resources. However, how to obtain the information needed by users quickly and efficiently from massive, unstructured or semi-structured data is still a hot issue to be solved, so WEB information extraction technology emerges as the times require. At present, scholars have done a lot of research work, but the existing technology still has many shortcomings: the extraction method is too professional, not only increases the burden of user semantic understanding, but also is not easy for users to use; in the extraction process, it is difficult to obtain user feedback in time, affecting the extraction effect; the more complex the extraction content, the worse the robustness of the extraction rules. Based on this, based on the in-depth study of XML and related standards and the existing XML extraction methods, a WEB information extraction method based on XML user custom requirements is proposed in this paper. The research work includes the following aspects: (1) the extracted page is processed. The HTML page filters out the unrelated information and code after preprocessing and converts it into a format-standardized XML document. In order to make the user clearly master the page structure, the XML document is parsed to generate a visual DOM tree form. In the process of node conversion, each node type is marked and its path expression is calculated. Prepare for sample mapping and generation of extraction rules. (2) to realize the acquisition of users' extraction requirements. In this paper, the hierarchical relationship between the data nodes to be extracted is described by defining the target, and it is used as the style structure of the extraction information output. The sample of user tag is used as the basis of extraction rule generation, and the sample maps to the target structure in the way of structure mapping or content mapping according to the mapping rule, so as to obtain the node type information and location information of the data to be extracted. (3) the construction of extraction rules is realized. The extraction rule consists of one or more templates for each layer of the matching target structure. The template is constructed according to whether there is a structural mapping in the root node of the target structure. There is a structural mapping in the root node. The class attribute of the sample structure mapping is used to match the full text node of the same class, and the relative path is used to cover the parent-child relationship and the ancestor and descendant relationship, and each layer of node template is generated recursively. There is no structure mapping in the root node, and the common path is obtained by its child nodes as the starting point of template matching. Because the starting point position is unique, the extraction is only sample data. Finally, the effectiveness of the proposed method is verified by comparative experiments, and it is proved that the extraction effect of this method is better than that of the existing two methods. When the extraction content structure is complex, the extraction rules have good robustness. At the same time, the prototype system of the method is realized, and the system demonstration shows that the user can not only intuitively observe the whole process of information extraction, but also determine whether the extraction result is accurate and can be modified conveniently.
【學(xué)位授予單位】:西南大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2014
【分類號】:TP391.1;TP393.092
【參考文獻】
相關(guān)期刊論文 前10條
1 李文華;楊亞仿;吳昊;;基于正則表達式的HTML信息提取[J];電腦開發(fā)與應(yīng)用;2012年04期
2 譚鋒;崔亮亮;;基于XPath的XML數(shù)據(jù)提取的C#實現(xiàn)[J];電腦知識與技術(shù);2011年09期
3 王茹,宋瀚濤,陸玉昌;網(wǎng)頁數(shù)據(jù)自動抽取系統(tǒng)[J];計算機工程與應(yīng)用;2004年19期
4 賀智平;徐學(xué)洲;李愛玲;;一種基于信息熵的Web頁面主題信息抽取方法[J];計算機工程與應(yīng)用;2007年04期
5 賈燕花;徐蔚鴻;;K-means聚類和支持向量機結(jié)合的文本分類研究[J];計算機工程與應(yīng)用;2010年22期
6 李超鋒;盧炎生;;基于URL結(jié)構(gòu)和訪問時間的Web頁面訪問相似性度量[J];計算機科學(xué);2007年04期
7 劉華;;網(wǎng)頁信息抽取及建庫系統(tǒng)C#實現(xiàn)[J];計算機工程;2006年16期
8 曲著偉;李敏強;;基于數(shù)據(jù)區(qū)域發(fā)現(xiàn)的信息抽取規(guī)則生成方法[J];計算機工程;2009年22期
9 王敬普;林亞平;周順先;岳文;;基于包裝器模型的文本信息抽取[J];計算機應(yīng)用;2006年03期
10 陳佳;胡燕;軒艷艷;;一種基于XML的Web信息抽取方法[J];計算機與數(shù)字工程;2007年06期
,本文編號:2508712
本文鏈接:http://sikaile.net/guanlilunwen/ydhl/2508712.html