基于web的改進(jìn)信息抽取算法的設(shè)計與實現(xiàn)
發(fā)布時間:2018-03-03 22:33
本文選題:信息抽取 切入點:雙序列比對 出處:《電子科技大學(xué)》2014年碩士論文 論文類型:學(xué)位論文
【摘要】:隨著Internet及其相關(guān)技術(shù)的飛速發(fā)展,互聯(lián)網(wǎng)已成為人們發(fā)布和獲取信息的主要平臺。由于互聯(lián)網(wǎng)上的信息泛濫,使得用戶獲取有用信息變得困難。從Web網(wǎng)頁中搜索特定信息的能力還不足以滿足用戶的需求。所以,如何研究出一種有效的信息抽取方法應(yīng)用在Web頁面信息抽取系統(tǒng)中,已經(jīng)成為當(dāng)今亟需解決的熱點研究問題。本文主要研究了一種新的信息抽取算法。該方法針對數(shù)據(jù)密集型的頁面自動進(jìn)行信息抽取。其中包括了下面幾個問題。首先,要進(jìn)行初始化工作。將訓(xùn)練集合中的所有樣本頁面轉(zhuǎn)換成HTML文檔形式。其次,如何自動去除頁面噪聲的問題。目前很多網(wǎng)站的頁面上都會有導(dǎo)航欄、廣告、LOGO、版權(quán)信息等與主題內(nèi)容無關(guān)的信息,例如淘寶、團(tuán)購、旅游網(wǎng)等商業(yè)網(wǎng)站。本文運(yùn)用一種改進(jìn)的雙序列比對算法來去除網(wǎng)頁中的噪聲。然后,進(jìn)行模板自動抽取。如今,動態(tài)頁面技術(shù)被許多網(wǎng)站采用,應(yīng)用于網(wǎng)站設(shè)計等各個方面。本文研究的“動態(tài)”為模板和后臺數(shù)據(jù)庫相結(jié)合的技術(shù)進(jìn)行Web信息抽取方法。并將去噪后的頁面修補(bǔ)成規(guī)范的標(biāo)準(zhǔn)頁面作為訓(xùn)練集合,利用模板抽取算法進(jìn)行實驗。最后,在來自真實網(wǎng)站的數(shù)據(jù)密集型網(wǎng)頁集合上進(jìn)行實驗,實驗結(jié)果充分說明了改進(jìn)的雙序列比對在頁面去噪方面的有效性,以及本文所設(shè)計的信息抽取系統(tǒng)在信息抽取方面的有效性。
[Abstract]:With the rapid development of Internet and its related technologies, the Internet has become the main platform for people to publish and obtain information. It is difficult for users to obtain useful information. The ability of searching specific information from Web pages is not enough to meet the needs of users. Therefore, how to develop an effective information extraction method for Web page information extraction system, This paper mainly studies a new information extraction algorithm, which is used to extract information automatically for data-intensive pages. It includes the following several problems. To initialize. Convert all sample pages in the training collection into HTML document form. Second, how to automatically remove page noise. Currently, many websites have navigation bars on their pages. Advertising, copyright information and other information that is not related to the subject content, such as Taobao, Group purchase, Travel net and other commercial websites. This paper uses an improved algorithm of double sequence alignment to remove the noise in the web page. Then, the template is extracted automatically. Dynamic page technology is used by many websites, It is applied to website design and other aspects. The technology of "dynamic" in this paper combines template and backstage database to extract Web information, and the de-noised page is patched into a standard page as a training set. Finally, the experiment is carried out on the data intensive web pages set from real websites. The experimental results fully demonstrate the effectiveness of the improved double sequence alignment in page denoising. And the effectiveness of the information extraction system designed in this paper.
【學(xué)位授予單位】:電子科技大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2014
【分類號】:TP393.092;TP391.1
【參考文獻(xiàn)】
相關(guān)期刊論文 前4條
1 張永;王瑞;;生物信息學(xué)中的序列比對算法[J];電腦知識與技術(shù);2008年01期
2 李文奇,張忠能;頁面包裝器自動生成的改進(jìn)算法[J];計算機(jī)工程與應(yīng)用;2004年22期
3 張國平;李釗;;網(wǎng)頁信息抽取RoadRunner技術(shù)淺析[J];科技創(chuàng)業(yè)月刊;2010年11期
4 李劍波;李小華;董樹明;楊科華;;一種基于XML的Web信息抽取方法[J];情報雜志;2006年08期
,本文編號:1563008
本文鏈接:http://sikaile.net/wenyilunwen/guanggaoshejilunwen/1563008.html
最近更新
教材專著