天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于網(wǎng)頁(yè)結(jié)構(gòu)的信息抽取關(guān)鍵技術(shù)研究

發(fā)布時(shí)間:2018-03-18 04:03

  本文選題:搜索引擎 切入點(diǎn):主題型網(wǎng)頁(yè) 出處:《華南理工大學(xué)》2011年碩士論文 論文類型:學(xué)位論文


【摘要】:互聯(lián)網(wǎng)已經(jīng)成為人們生活中重要的信息來源,在網(wǎng)絡(luò)信息快速增長(zhǎng)的情況下,如何從海量的信息中找到用戶所要的信息是一個(gè)很大的挑戰(zhàn)。搜索引擎的出現(xiàn)使得這個(gè)問題得到了比較好的解決,但是由于網(wǎng)絡(luò)中大量的信息都是用HTML語言來發(fā)布的,而HTML本身是一種半結(jié)構(gòu)化的語言,這種語言用定義好的標(biāo)簽來組織信息,只有少量的標(biāo)簽本身能提供的信息。 互聯(lián)網(wǎng)上的HTML網(wǎng)頁(yè)雖然千差萬別,但是有兩類網(wǎng)頁(yè)的特點(diǎn)是非常明顯的:主題型網(wǎng)頁(yè)和非主題型網(wǎng)頁(yè)。非主題型網(wǎng)頁(yè)的特點(diǎn)是整個(gè)網(wǎng)頁(yè)的鏈接非常多,并且整個(gè)網(wǎng)頁(yè)沒有統(tǒng)一的主題,互聯(lián)網(wǎng)上的門戶網(wǎng)站及其次級(jí)站點(diǎn)是這類型網(wǎng)頁(yè)的典型。主題型網(wǎng)頁(yè)的特點(diǎn)網(wǎng)頁(yè)有中心主題而且按照其頁(yè)面的布局可以分為導(dǎo)航、主題、版權(quán)信息、廣告等部分,新聞網(wǎng)頁(yè)是這種網(wǎng)頁(yè)的典型例子。 本文針對(duì)主題型網(wǎng)頁(yè)設(shè)計(jì)了新的網(wǎng)頁(yè)分塊方法,該方法采用網(wǎng)頁(yè)的組織標(biāo)簽作為分割依據(jù),設(shè)定了若干分塊規(guī)則。與木棉原有分塊分塊方法相比,新方法引入了臨時(shí)分塊池,以便于將分塊之間的小塊合并成為一個(gè)大塊,使分塊粒度不至于過細(xì)。另外新方法還引入了分塊類型的判斷規(guī)則用于判斷分塊的屬性,分塊共分為鏈接塊,頁(yè)腳塊,噪音塊,主題塊四種類型,新分塊方法只保留了主題塊,其他類型的塊作為因?yàn)楹行畔⒘可俣粊G棄。 在分塊的基礎(chǔ)之上,本文針對(duì)華南理工校園網(wǎng)網(wǎng)頁(yè)設(shè)計(jì)并實(shí)現(xiàn)了新的信息抽取方法,這些方法用于抽取校內(nèi)網(wǎng)頁(yè)中的如下信息:網(wǎng)頁(yè)標(biāo)題,網(wǎng)頁(yè)發(fā)布時(shí)間,網(wǎng)頁(yè)描述圖片,網(wǎng)頁(yè)正文文本。原有系統(tǒng)已經(jīng)對(duì)前三項(xiàng)信息進(jìn)行抽取,但是沒有利用到網(wǎng)頁(yè)的主題信息,因此抽取的信息不夠全面或者有些信息抽取不夠準(zhǔn)確,新的方法充分利用了網(wǎng)頁(yè)的主題信息,有效地改善了信息抽取的準(zhǔn)確性,新方法增加了網(wǎng)頁(yè)正文文本這一項(xiàng)的抽取,可用于網(wǎng)頁(yè)文本摘要。 本文最后對(duì)網(wǎng)頁(yè)的基本性質(zhì),網(wǎng)頁(yè)分塊以及信息抽取方法進(jìn)行評(píng)測(cè),評(píng)測(cè)將在以下三個(gè)方面展開:網(wǎng)頁(yè)性質(zhì)測(cè)試,分塊方法性能對(duì)比,信息抽取應(yīng)用結(jié)果。其中信息抽取應(yīng)用于木棉檢索系統(tǒng)中,比較原有方法和新抽取方法的抽取信息的效果。測(cè)試的數(shù)據(jù)集由華工校內(nèi)網(wǎng)頁(yè)和互聯(lián)網(wǎng)9個(gè)門戶網(wǎng)站的主題型網(wǎng)頁(yè)和非主題型網(wǎng)頁(yè)組成。
[Abstract]:The Internet has become an important source of information in people's lives. How to find the information users want from the mass of information is a great challenge. The emergence of search engine makes this problem solved better, but because a lot of information in the network is published in HTML language, HTML itself is a semi-structured language, which uses defined tags to organize information, with only a small number of tags itself providing information. HTML pages on the Internet are very different, but the characteristics of two types of pages are very obvious: theme pages and non-thematic pages. And there is no uniform theme for the whole page. The portal and its secondary sites on the Internet are typical of this type of webpage. The characteristic pages of themed pages have central themes and can be divided into navigation and themes according to the layout of their pages. Copyright information, advertising, etc., news pages are typical examples of such web pages. In this paper, a new method of web page partitioning is designed for thematic web pages. In this method, the organizational labels of web pages are used as the basis of segmentation, and some rules of partitioning are set up. Compared with the original block partitioning method of kapok, the new method introduces temporary block pools. In addition, the new method also introduces the judging rule of block type to judge the attribute of block, which is divided into link block, footer block and noise block. There are four types of topic blocks. The new block method only preserves topic blocks, while other types of blocks are discarded because they contain little information. On the basis of the block, this paper designs and implements a new information extraction method for the campus network of South China Science and Technology. These methods are used to extract the following information from the internal pages: page title, page release time, page description picture, etc. The original system has already extracted the first three items of information, but has not used the subject information of the page, so the extracted information is not comprehensive enough or some information extraction is not accurate enough, The new method makes full use of the topic information of the web page and improves the accuracy of the information extraction effectively. The new method adds the extraction of the text of the text of the web page and can be used in the text summary of the web page. At the end of this paper, the basic properties of web pages, the segmentation of web pages and the methods of information extraction are evaluated. The evaluation will be carried out in the following three aspects: testing the properties of web pages, comparing the performance of the partitioning methods, Application results of information extraction. Among them, information extraction is used in kapok cotton retrieval system, The data set is composed of subject pages and non-thematic pages of 9 Internet portals.
【學(xué)位授予單位】:華南理工大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2011
【分類號(hào)】:TP393.092

【引證文獻(xiàn)】

相關(guān)碩士學(xué)位論文 前1條

1 熊芝;中文網(wǎng)頁(yè)自動(dòng)摘要系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)[D];華南理工大學(xué);2011年



本文編號(hào):1627885

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/wenyilunwen/guanggaoshejilunwen/1627885.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶5e568***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請(qǐng)E-mail郵箱bigeng88@qq.com