網(wǎng)頁主題信息抽取系統(tǒng)設(shè)計(jì)與實(shí)現(xiàn)

發(fā)布時(shí)間：2018-05-01 20:42

本文選題：網(wǎng)頁主題信息抽取 + 網(wǎng)頁預(yù)處理��；參考：《哈爾濱工業(yè)大學(xué)》2012年碩士論文

【摘要】：隨著互聯(lián)網(wǎng)信息爆炸式增長，互聯(lián)網(wǎng)已經(jīng)成為人們?nèi)粘Ｉ钪行畔⒌闹匾獊碓�。由于信息量非常大，人工手動查找已�?jīng)變得越來越困難，所以搜索引擎已經(jīng)成為人們?nèi)粘Ｉ町?dāng)中不可或缺的工具。搜索引擎的本質(zhì)是利用信息去找信息，利用信息的第一步是對信息本身的理解，，而搜索引擎所利用的信息大部分是含有大量噪音信息的網(wǎng)頁，所以對網(wǎng)頁信息的抽取成為搜索引擎從業(yè)人員關(guān)注的重點(diǎn)課題。本文實(shí)現(xiàn)了一種通用的網(wǎng)頁主題信息抽取方法。針對現(xiàn)在互聯(lián)網(wǎng)上很多網(wǎng)頁都不是嚴(yán)格規(guī)范化的網(wǎng)頁，本文首先進(jìn)行網(wǎng)頁預(yù)處理，對網(wǎng)頁進(jìn)行文件類型識別、編碼處理、腳本抽取以及網(wǎng)頁容錯(cuò)與凈化處理。針對現(xiàn)有網(wǎng)頁主題信息抽取系統(tǒng)沒有利用網(wǎng)頁本身結(jié)構(gòu)特征及視覺特征，本文提出一種利用視覺信息與語義特征的網(wǎng)頁主題信息提取算法，算法利用網(wǎng)頁解析把半結(jié)構(gòu)化的網(wǎng)頁文件解析成結(jié)構(gòu)化的DOM（DocumentObjectModel）樹，同時(shí)把CSS（CascadingStyleSheets）信息解析出來，對DOM樹節(jié)點(diǎn)進(jìn)行染色，形成一棵帶有視覺信息的DOM樹。然后利用VIPS（Vision-BasedPageSegmentation）算法對網(wǎng)頁進(jìn)行劃分，形成一棵層次化的具有單獨(dú)語義特征的內(nèi)容樹，之后對內(nèi)容塊進(jìn)行層次聚類，把臨近的塊聚合到一個(gè)類別當(dāng)中，形成聚類的集合。最后利用內(nèi)容塊的結(jié)構(gòu)特征與語義特征，對每個(gè)塊進(jìn)行主題相關(guān)度打分，根據(jù)預(yù)先設(shè)定的閾值對主題信息抽取與輸出。在對中文網(wǎng)頁上的實(shí)驗(yàn)結(jié)果表明，在中文新聞網(wǎng)頁的的抽取上，精度F值達(dá)到0.93，在中文普通網(wǎng)頁的抽取上，F(xiàn)值也能夠達(dá)到0.84。實(shí)驗(yàn)結(jié)果表明，本文方法基本滿足實(shí)際使用要求。
[Abstract]:With the explosion of Internet information, the Internet has become an important source of information in people's daily life. As the amount of information is very large, manual search has become more and more difficult, so the search engine has become an indispensable tool in people's daily life. The essence of search engines is to use information to find information. The first step of the use of information is to understand the information itself, and the information used by the search engine is mostly a web page containing a lot of noise information, so the extraction of Web information has become the focus of the search engine employees.
In this paper, a common web page topic information extraction method is implemented. In this paper, many web pages on the Internet are not strictly normalized pages. This paper first performs web page preprocessing, file type identification, coding processing, script extraction, and web page fault tolerance and purification. It does not make use of the structural and visual features of the web page itself. This paper presents a web page topic information extraction algorithm using visual information and semantic features. The algorithm uses web page resolution to parse the semi structured web pages into a structured DOM (DocumentObjectModel) tree and parse the CSS (CascadingStyleSheets) information at the same time. The DOM tree nodes are dyed to form a DOM tree with visual information. Then the VIPS (Vision-BasedPageSegmentation) algorithm is used to divide the web pages to form a hierarchical content tree with separate semantic features. After that, the content blocks are hierarchical clustering, and the adjacent blocks are aggregated into one category to form a cluster set. Finally, using the structural features and semantic features of the content blocks, the topic correlation of each block is scored and the subject information is extracted and output according to the predetermined threshold.
The experimental results on Chinese web pages show that the accuracy of F is 0.93 in the extraction of Chinese News Web pages. In the extraction of Chinese common web pages, the F value can also reach 0.84. experimental results. This method basically meets the requirements of actual use.

【學(xué)位授予單位】：哈爾濱工業(yè)大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2012
【分類號】：TP391.3;TP393.092

【參考文獻(xiàn)】

相關(guān)期刊論文前10條

1 楊s

本文編號：1830973

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1830973.html

上一篇：Paperopen中的OA論文垂直蜘蛛設(shè)計(jì)
下一篇：基于領(lǐng)域本體的信息檢索優(yōu)化策略

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

網(wǎng)頁主題信息抽取系統(tǒng)設(shè)計(jì)與實(shí)現(xiàn)