天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當(dāng)前位置:主頁(yè) > 科技論文 > 搜索引擎論文 >

基于語(yǔ)義DOM的WEB信息抽取

發(fā)布時(shí)間:2018-11-18 14:30
【摘要】:在Internet飛速發(fā)展的今天,Web已經(jīng)成為全球最大的、分布式的、共享的信息資源。面對(duì)最大的信息資源,如何從中獲取有用的信息已經(jīng)成為目前亟待解決的問(wèn)題,因此搜索引擎技術(shù)得以蓬勃發(fā)展,由于Web頁(yè)面結(jié)構(gòu)復(fù)雜性、異構(gòu)性、動(dòng)態(tài)性、開(kāi)放性等特點(diǎn)使得當(dāng)前搜索引擎的檢索性能不盡人意。為了提高檢索性能,在搜索引擎技術(shù)中引進(jìn)數(shù)據(jù)挖掘技術(shù),對(duì)Web頁(yè)面進(jìn)行結(jié)構(gòu)化處理,而Web頁(yè)面結(jié)構(gòu)化處理技術(shù)中的重要研究問(wèn)題就是Web頁(yè)面信息抽取。 本文針對(duì)Web頁(yè)面數(shù)據(jù)復(fù)雜性、異構(gòu)性等特點(diǎn),建立了一種基于語(yǔ)義DOM的WEB信息自動(dòng)抽取技術(shù),該技術(shù)中,我們分別對(duì)模板規(guī)則提取、基于DOM樹(shù)的內(nèi)容信息抽取和基于語(yǔ)義DOM的內(nèi)容信息抽取技術(shù)作了深入的研究。 首先,本文介紹了頁(yè)面信息抽取技術(shù)的發(fā)展歷史、國(guó)內(nèi)外的研究狀況,并對(duì)列舉出典型的web信息抽取技術(shù)進(jìn)行了綜合比較,指出其優(yōu)缺點(diǎn)。最后詳細(xì)介紹了語(yǔ)義化標(biāo)簽、DOM模型、XHTML的理論和編程實(shí)踐技術(shù)。 本文研究的信息抽取技術(shù)基于DOM(文檔結(jié)構(gòu)模型)和標(biāo)簽語(yǔ)義化,其中DOM是W3C的一個(gè)標(biāo)準(zhǔn),它以樹(shù)數(shù)據(jù)結(jié)構(gòu)來(lái)描述網(wǎng)頁(yè)文檔,并且提供標(biāo)準(zhǔn)的接口方法對(duì)頁(yè)面節(jié)點(diǎn)進(jìn)行操作。而標(biāo)簽語(yǔ)義化也是W3C所倡導(dǎo)的一種使用標(biāo)簽的標(biāo)準(zhǔn),它使得HTML頁(yè)面的數(shù)據(jù)能夠讓更多的軟件識(shí)別和解析。其實(shí)現(xiàn)方式通過(guò)使用標(biāo)簽來(lái)說(shuō)明包含數(shù)據(jù)的意義。 接下來(lái),本文詳細(xì)闡述了基于語(yǔ)義DOM(文檔結(jié)構(gòu)模型)信息抽取的體系結(jié)構(gòu)、設(shè)計(jì)方法和處理流程。首先討論了HTML的標(biāo)準(zhǔn)化方法,基于DOM分析器將HTML或者XHTML文本轉(zhuǎn)換為DOM樹(shù)的技術(shù)解決方案,然后通過(guò)模板檢測(cè)來(lái)提高提取效率,最后進(jìn)一步根據(jù)語(yǔ)義化標(biāo)簽、文本加權(quán)的方式對(duì)DOM樹(shù)進(jìn)行剪枝、去噪,從而可以在純凈的DOM樹(shù)中抽取有用的信息格式化展示給用戶(hù)。
[Abstract]:With the rapid development of Internet, Web has become the largest, distributed and shared information resource in the world. In the face of the largest information resources, how to obtain useful information from it has become an urgent problem, so search engine technology can flourish, because of the complexity of the structure of Web pages, heterogeneity, dynamic, The characteristics of openness make the retrieval performance of current search engine unsatisfactory. In order to improve the retrieval performance, data mining technology is introduced into search engine technology to process Web pages structurally. The important research problem in Web page structured processing technology is Web page information extraction. In view of the complexity and heterogeneity of Web page data, a WEB information extraction technology based on semantic DOM is proposed in this paper. In this technology, we extract template rules respectively. The technology of content information extraction based on DOM tree and content information extraction based on semantic DOM has been deeply studied. First of all, this paper introduces the history of page information extraction technology, the research situation at home and abroad, and enumerates the typical web information extraction technology for a comprehensive comparison, pointing out its advantages and disadvantages. Finally, the semantic label, DOM model, XHTML theory and programming technology are introduced in detail. The information extraction technology studied in this paper is based on DOM (document structure Model) and label semantics. DOM is a W3C standard. It describes web documents by tree data structure and provides standard interface methods to operate page nodes. Label semantics is also a standard advocated by W3C, which enables more software to identify and parse the data of HTML pages. It is implemented by using tags to illustrate the meaning of containing data. Then, the architecture, design method and processing flow of information extraction based on semantic DOM (document structure Model) are described in detail. This paper first discusses the standardized method of HTML, the technical solution of converting HTML or XHTML text into DOM tree based on DOM analyzer, then improves the extraction efficiency by template detection, finally, according to the semantic label, Text weighted pruning and denoising of the DOM tree can extract useful information from the pure DOM tree and display it to the user.
【學(xué)位授予單位】:廣西師范大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2012
【分類(lèi)號(hào)】:TP391.1

【引證文獻(xiàn)】

相關(guān)碩士學(xué)位論文 前2條

1 楊小虎;Web頁(yè)面正文信息提取算法[D];廣西師范大學(xué);2013年

2 宋超;XML格式字處理文檔的WEB發(fā)布系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)[D];電子科技大學(xué);2013年



本文編號(hào):2340298

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2340298.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶(hù)5f580***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請(qǐng)E-mail郵箱bigeng88@qq.com