基于語義DOM的WEB信息抽取
發(fā)布時(shí)間:2018-11-18 14:30
【摘要】:在Internet飛速發(fā)展的今天,Web已經(jīng)成為全球最大的、分布式的、共享的信息資源。面對最大的信息資源,如何從中獲取有用的信息已經(jīng)成為目前亟待解決的問題,因此搜索引擎技術(shù)得以蓬勃發(fā)展,由于Web頁面結(jié)構(gòu)復(fù)雜性、異構(gòu)性、動態(tài)性、開放性等特點(diǎn)使得當(dāng)前搜索引擎的檢索性能不盡人意。為了提高檢索性能,在搜索引擎技術(shù)中引進(jìn)數(shù)據(jù)挖掘技術(shù),對Web頁面進(jìn)行結(jié)構(gòu)化處理,而Web頁面結(jié)構(gòu)化處理技術(shù)中的重要研究問題就是Web頁面信息抽取。 本文針對Web頁面數(shù)據(jù)復(fù)雜性、異構(gòu)性等特點(diǎn),建立了一種基于語義DOM的WEB信息自動抽取技術(shù),該技術(shù)中,我們分別對模板規(guī)則提取、基于DOM樹的內(nèi)容信息抽取和基于語義DOM的內(nèi)容信息抽取技術(shù)作了深入的研究。 首先,本文介紹了頁面信息抽取技術(shù)的發(fā)展歷史、國內(nèi)外的研究狀況,并對列舉出典型的web信息抽取技術(shù)進(jìn)行了綜合比較,指出其優(yōu)缺點(diǎn)。最后詳細(xì)介紹了語義化標(biāo)簽、DOM模型、XHTML的理論和編程實(shí)踐技術(shù)。 本文研究的信息抽取技術(shù)基于DOM(文檔結(jié)構(gòu)模型)和標(biāo)簽語義化,其中DOM是W3C的一個(gè)標(biāo)準(zhǔn),它以樹數(shù)據(jù)結(jié)構(gòu)來描述網(wǎng)頁文檔,并且提供標(biāo)準(zhǔn)的接口方法對頁面節(jié)點(diǎn)進(jìn)行操作。而標(biāo)簽語義化也是W3C所倡導(dǎo)的一種使用標(biāo)簽的標(biāo)準(zhǔn),它使得HTML頁面的數(shù)據(jù)能夠讓更多的軟件識別和解析。其實(shí)現(xiàn)方式通過使用標(biāo)簽來說明包含數(shù)據(jù)的意義。 接下來,本文詳細(xì)闡述了基于語義DOM(文檔結(jié)構(gòu)模型)信息抽取的體系結(jié)構(gòu)、設(shè)計(jì)方法和處理流程。首先討論了HTML的標(biāo)準(zhǔn)化方法,基于DOM分析器將HTML或者XHTML文本轉(zhuǎn)換為DOM樹的技術(shù)解決方案,然后通過模板檢測來提高提取效率,最后進(jìn)一步根據(jù)語義化標(biāo)簽、文本加權(quán)的方式對DOM樹進(jìn)行剪枝、去噪,從而可以在純凈的DOM樹中抽取有用的信息格式化展示給用戶。
[Abstract]:With the rapid development of Internet, Web has become the largest, distributed and shared information resource in the world. In the face of the largest information resources, how to obtain useful information from it has become an urgent problem, so search engine technology can flourish, because of the complexity of the structure of Web pages, heterogeneity, dynamic, The characteristics of openness make the retrieval performance of current search engine unsatisfactory. In order to improve the retrieval performance, data mining technology is introduced into search engine technology to process Web pages structurally. The important research problem in Web page structured processing technology is Web page information extraction. In view of the complexity and heterogeneity of Web page data, a WEB information extraction technology based on semantic DOM is proposed in this paper. In this technology, we extract template rules respectively. The technology of content information extraction based on DOM tree and content information extraction based on semantic DOM has been deeply studied. First of all, this paper introduces the history of page information extraction technology, the research situation at home and abroad, and enumerates the typical web information extraction technology for a comprehensive comparison, pointing out its advantages and disadvantages. Finally, the semantic label, DOM model, XHTML theory and programming technology are introduced in detail. The information extraction technology studied in this paper is based on DOM (document structure Model) and label semantics. DOM is a W3C standard. It describes web documents by tree data structure and provides standard interface methods to operate page nodes. Label semantics is also a standard advocated by W3C, which enables more software to identify and parse the data of HTML pages. It is implemented by using tags to illustrate the meaning of containing data. Then, the architecture, design method and processing flow of information extraction based on semantic DOM (document structure Model) are described in detail. This paper first discusses the standardized method of HTML, the technical solution of converting HTML or XHTML text into DOM tree based on DOM analyzer, then improves the extraction efficiency by template detection, finally, according to the semantic label, Text weighted pruning and denoising of the DOM tree can extract useful information from the pure DOM tree and display it to the user.
【學(xué)位授予單位】:廣西師范大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2012
【分類號】:TP391.1
本文編號:2340298
[Abstract]:With the rapid development of Internet, Web has become the largest, distributed and shared information resource in the world. In the face of the largest information resources, how to obtain useful information from it has become an urgent problem, so search engine technology can flourish, because of the complexity of the structure of Web pages, heterogeneity, dynamic, The characteristics of openness make the retrieval performance of current search engine unsatisfactory. In order to improve the retrieval performance, data mining technology is introduced into search engine technology to process Web pages structurally. The important research problem in Web page structured processing technology is Web page information extraction. In view of the complexity and heterogeneity of Web page data, a WEB information extraction technology based on semantic DOM is proposed in this paper. In this technology, we extract template rules respectively. The technology of content information extraction based on DOM tree and content information extraction based on semantic DOM has been deeply studied. First of all, this paper introduces the history of page information extraction technology, the research situation at home and abroad, and enumerates the typical web information extraction technology for a comprehensive comparison, pointing out its advantages and disadvantages. Finally, the semantic label, DOM model, XHTML theory and programming technology are introduced in detail. The information extraction technology studied in this paper is based on DOM (document structure Model) and label semantics. DOM is a W3C standard. It describes web documents by tree data structure and provides standard interface methods to operate page nodes. Label semantics is also a standard advocated by W3C, which enables more software to identify and parse the data of HTML pages. It is implemented by using tags to illustrate the meaning of containing data. Then, the architecture, design method and processing flow of information extraction based on semantic DOM (document structure Model) are described in detail. This paper first discusses the standardized method of HTML, the technical solution of converting HTML or XHTML text into DOM tree based on DOM analyzer, then improves the extraction efficiency by template detection, finally, according to the semantic label, Text weighted pruning and denoising of the DOM tree can extract useful information from the pure DOM tree and display it to the user.
【學(xué)位授予單位】:廣西師范大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2012
【分類號】:TP391.1
【引證文獻(xiàn)】
相關(guān)碩士學(xué)位論文 前2條
1 楊小虎;Web頁面正文信息提取算法[D];廣西師范大學(xué);2013年
2 宋超;XML格式字處理文檔的WEB發(fā)布系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)[D];電子科技大學(xué);2013年
,本文編號:2340298
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2340298.html
最近更新
教材專著