基于網(wǎng)頁結(jié)構(gòu)的信息抽取關(guān)鍵技術(shù)研究

發(fā)布時間：2018-03-18 04:03

本文選題：搜索引擎　切入點：主題型網(wǎng)頁　出處：《華南理工大學》2011年碩士論文　論文類型：學位論文

【摘要】：互聯(lián)網(wǎng)已經(jīng)成為人們生活中重要的信息來源,在網(wǎng)絡(luò)信息快速增長的情況下,如何從海量的信息中找到用戶所要的信息是一個很大的挑戰(zhàn)。搜索引擎的出現(xiàn)使得這個問題得到了比較好的解決,但是由于網(wǎng)絡(luò)中大量的信息都是用HTML語言來發(fā)布的,而HTML本身是一種半結(jié)構(gòu)化的語言,這種語言用定義好的標簽來組織信息,只有少量的標簽本身能提供的信息。互聯(lián)網(wǎng)上的HTML網(wǎng)頁雖然千差萬別,但是有兩類網(wǎng)頁的特點是非常明顯的:主題型網(wǎng)頁和非主題型網(wǎng)頁。非主題型網(wǎng)頁的特點是整個網(wǎng)頁的鏈接非常多,并且整個網(wǎng)頁沒有統(tǒng)一的主題,互聯(lián)網(wǎng)上的門戶網(wǎng)站及其次級站點是這類型網(wǎng)頁的典型。主題型網(wǎng)頁的特點網(wǎng)頁有中心主題而且按照其頁面的布局可以分為導(dǎo)航、主題、版權(quán)信息、廣告等部分,新聞網(wǎng)頁是這種網(wǎng)頁的典型例子。本文針對主題型網(wǎng)頁設(shè)計了新的網(wǎng)頁分塊方法,該方法采用網(wǎng)頁的組織標簽作為分割依據(jù),設(shè)定了若干分塊規(guī)則。與木棉原有分塊分塊方法相比,新方法引入了臨時分塊池,以便于將分塊之間的小塊合并成為一個大塊,使分塊粒度不至于過細。另外新方法還引入了分塊類型的判斷規(guī)則用于判斷分塊的屬性,分塊共分為鏈接塊,頁腳塊,噪音塊,主題塊四種類型,新分塊方法只保留了主題塊,其他類型的塊作為因為含有信息量少而被丟棄。在分塊的基礎(chǔ)之上,本文針對華南理工校園網(wǎng)網(wǎng)頁設(shè)計并實現(xiàn)了新的信息抽取方法,這些方法用于抽取校內(nèi)網(wǎng)頁中的如下信息:網(wǎng)頁標題,網(wǎng)頁發(fā)布時間,網(wǎng)頁描述圖片,網(wǎng)頁正文文本。原有系統(tǒng)已經(jīng)對前三項信息進行抽取,但是沒有利用到網(wǎng)頁的主題信息,因此抽取的信息不夠全面或者有些信息抽取不夠準確,新的方法充分利用了網(wǎng)頁的主題信息,有效地改善了信息抽取的準確性,新方法增加了網(wǎng)頁正文文本這一項的抽取,可用于網(wǎng)頁文本摘要。本文最后對網(wǎng)頁的基本性質(zhì),網(wǎng)頁分塊以及信息抽取方法進行評測,評測將在以下三個方面展開:網(wǎng)頁性質(zhì)測試,分塊方法性能對比,信息抽取應(yīng)用結(jié)果。其中信息抽取應(yīng)用于木棉檢索系統(tǒng)中,比較原有方法和新抽取方法的抽取信息的效果。測試的數(shù)據(jù)集由華工校內(nèi)網(wǎng)頁和互聯(lián)網(wǎng)9個門戶網(wǎng)站的主題型網(wǎng)頁和非主題型網(wǎng)頁組成。
[Abstract]:The Internet has become an important source of information in people's lives. How to find the information users want from the mass of information is a great challenge. The emergence of search engine makes this problem solved better, but because a lot of information in the network is published in HTML language, HTML itself is a semi-structured language, which uses defined tags to organize information, with only a small number of tags itself providing information. HTML pages on the Internet are very different, but the characteristics of two types of pages are very obvious: theme pages and non-thematic pages. And there is no uniform theme for the whole page. The portal and its secondary sites on the Internet are typical of this type of webpage. The characteristic pages of themed pages have central themes and can be divided into navigation and themes according to the layout of their pages. Copyright information, advertising, etc., news pages are typical examples of such web pages. In this paper, a new method of web page partitioning is designed for thematic web pages. In this method, the organizational labels of web pages are used as the basis of segmentation, and some rules of partitioning are set up. Compared with the original block partitioning method of kapok, the new method introduces temporary block pools. In addition, the new method also introduces the judging rule of block type to judge the attribute of block, which is divided into link block, footer block and noise block. There are four types of topic blocks. The new block method only preserves topic blocks, while other types of blocks are discarded because they contain little information. On the basis of the block, this paper designs and implements a new information extraction method for the campus network of South China Science and Technology. These methods are used to extract the following information from the internal pages: page title, page release time, page description picture, etc. The original system has already extracted the first three items of information, but has not used the subject information of the page, so the extracted information is not comprehensive enough or some information extraction is not accurate enough, The new method makes full use of the topic information of the web page and improves the accuracy of the information extraction effectively. The new method adds the extraction of the text of the text of the web page and can be used in the text summary of the web page. At the end of this paper, the basic properties of web pages, the segmentation of web pages and the methods of information extraction are evaluated. The evaluation will be carried out in the following three aspects: testing the properties of web pages, comparing the performance of the partitioning methods, Application results of information extraction. Among them, information extraction is used in kapok cotton retrieval system, The data set is composed of subject pages and non-thematic pages of 9 Internet portals.
【學位授予單位】：華南理工大學
【學位級別】：碩士
【學位授予年份】：2011
【分類號】：TP393.092

【引證文獻】

相關(guān)碩士學位論文前1條

1 熊芝;中文網(wǎng)頁自動摘要系統(tǒng)的設(shè)計與實現(xiàn)[D];華南理工大學;2011年

，

本文編號：1627885

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/wenyilunwen/guanggaoshejilunwen/1627885.html

上一篇：廣告費究竟是怎樣被浪費的
下一篇：視頻底層特征選取及其與觀眾評價的相關(guān)分析

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于網(wǎng)頁結(jié)構(gòu)的信息抽取關(guān)鍵技術(shù)研究