網(wǎng)頁主題信息抽取系統(tǒng)設計與實現(xiàn)
發(fā)布時間:2018-05-01 20:42
本文選題:網(wǎng)頁主題信息抽取 + 網(wǎng)頁預處理; 參考:《哈爾濱工業(yè)大學》2012年碩士論文
【摘要】:隨著互聯(lián)網(wǎng)信息爆炸式增長,互聯(lián)網(wǎng)已經(jīng)成為人們?nèi)粘I钪行畔⒌闹匾獊碓。由于信息量非常大,人工手動查找已?jīng)變得越來越困難,所以搜索引擎已經(jīng)成為人們?nèi)粘I町斨胁豢苫蛉钡墓ぞ摺K阉饕娴谋举|(zhì)是利用信息去找信息,利用信息的第一步是對信息本身的理解,,而搜索引擎所利用的信息大部分是含有大量噪音信息的網(wǎng)頁,所以對網(wǎng)頁信息的抽取成為搜索引擎從業(yè)人員關(guān)注的重點課題。 本文實現(xiàn)了一種通用的網(wǎng)頁主題信息抽取方法。針對現(xiàn)在互聯(lián)網(wǎng)上很多網(wǎng)頁都不是嚴格規(guī)范化的網(wǎng)頁,本文首先進行網(wǎng)頁預處理,對網(wǎng)頁進行文件類型識別、編碼處理、腳本抽取以及網(wǎng)頁容錯與凈化處理。針對現(xiàn)有網(wǎng)頁主題信息抽取系統(tǒng)沒有利用網(wǎng)頁本身結(jié)構(gòu)特征及視覺特征,本文提出一種利用視覺信息與語義特征的網(wǎng)頁主題信息提取算法,算法利用網(wǎng)頁解析把半結(jié)構(gòu)化的網(wǎng)頁文件解析成結(jié)構(gòu)化的DOM(DocumentObjectModel)樹,同時把CSS(CascadingStyleSheets)信息解析出來,對DOM樹節(jié)點進行染色,形成一棵帶有視覺信息的DOM樹。然后利用VIPS(Vision-BasedPageSegmentation)算法對網(wǎng)頁進行劃分,形成一棵層次化的具有單獨語義特征的內(nèi)容樹,之后對內(nèi)容塊進行層次聚類,把臨近的塊聚合到一個類別當中,形成聚類的集合。最后利用內(nèi)容塊的結(jié)構(gòu)特征與語義特征,對每個塊進行主題相關(guān)度打分,根據(jù)預先設定的閾值對主題信息抽取與輸出。 在對中文網(wǎng)頁上的實驗結(jié)果表明,在中文新聞網(wǎng)頁的的抽取上,精度F值達到0.93,在中文普通網(wǎng)頁的抽取上,F(xiàn)值也能夠達到0.84。實驗結(jié)果表明,本文方法基本滿足實際使用要求。
[Abstract]:With the explosion of Internet information, the Internet has become an important source of information in people's daily life. As the amount of information is very large, manual search has become more and more difficult, so the search engine has become an indispensable tool in people's daily life. The essence of search engines is to use information to find information. The first step of the use of information is to understand the information itself, and the information used by the search engine is mostly a web page containing a lot of noise information, so the extraction of Web information has become the focus of the search engine employees.
In this paper, a common web page topic information extraction method is implemented. In this paper, many web pages on the Internet are not strictly normalized pages. This paper first performs web page preprocessing, file type identification, coding processing, script extraction, and web page fault tolerance and purification. It does not make use of the structural and visual features of the web page itself. This paper presents a web page topic information extraction algorithm using visual information and semantic features. The algorithm uses web page resolution to parse the semi structured web pages into a structured DOM (DocumentObjectModel) tree and parse the CSS (CascadingStyleSheets) information at the same time. The DOM tree nodes are dyed to form a DOM tree with visual information. Then the VIPS (Vision-BasedPageSegmentation) algorithm is used to divide the web pages to form a hierarchical content tree with separate semantic features. After that, the content blocks are hierarchical clustering, and the adjacent blocks are aggregated into one category to form a cluster set. Finally, using the structural features and semantic features of the content blocks, the topic correlation of each block is scored and the subject information is extracted and output according to the predetermined threshold.
The experimental results on Chinese web pages show that the accuracy of F is 0.93 in the extraction of Chinese News Web pages. In the extraction of Chinese common web pages, the F value can also reach 0.84. experimental results. This method basically meets the requirements of actual use.
【學位授予單位】:哈爾濱工業(yè)大學
【學位級別】:碩士
【學位授予年份】:2012
【分類號】:TP391.3;TP393.092
【參考文獻】
相關(guān)期刊論文 前10條
1 楊s
本文編號:1830973
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1830973.html
最近更新
教材專著