垂直搜索引擎中網(wǎng)頁(yè)信息抽取技術(shù)的研究
發(fā)布時(shí)間:2018-02-17 02:41
本文關(guān)鍵詞: 垂直搜索引擎 Web對(duì)象 信息抽取 VIPS 分塊重要度 2D CRFs HCRFs 出處:《江南大學(xué)》2012年碩士論文 論文類(lèi)型:學(xué)位論文
【摘要】:隨著互聯(lián)網(wǎng)的迅速發(fā)展,網(wǎng)絡(luò)上的信息資源呈爆炸式的增長(zhǎng),通用搜索引擎的瓶頸越發(fā)的顯露出來(lái),為了更加快速、準(zhǔn)確的定位到人們想要的信息,近年來(lái)產(chǎn)生了垂直搜索引擎。它是面向某一特定領(lǐng)域的搜索引擎,提供比通用搜索引擎更精細(xì)化的搜索結(jié)果,因此需要從網(wǎng)頁(yè)中抽取出與領(lǐng)域相關(guān)的信息。本文主要對(duì)垂直搜索引擎中的網(wǎng)頁(yè)信息抽取技術(shù)進(jìn)行學(xué)習(xí)和研究,具體內(nèi)容包括以下幾個(gè)方面: (1)基于視覺(jué)特征的Web頁(yè)面分析技術(shù)。 在對(duì)基于視覺(jué)特征的頁(yè)面分割方法(VIPS)進(jìn)行學(xué)習(xí)和研究的基礎(chǔ)上,實(shí)現(xiàn)了VIPS算法的原型系統(tǒng),并應(yīng)用該系統(tǒng)對(duì)待抽取Web頁(yè)面進(jìn)行分割,為后續(xù)的抽取工作提供數(shù)據(jù)準(zhǔn)備。 (2)基于分塊重要度和2D CRFs的Web對(duì)象信息抽取。 該部分針對(duì)Web對(duì)象信息抽取流程,提出了一種基于分塊重要度和2D CRFs的Web對(duì)象信息抽取方法。首先使用分塊重要度模型(BIM)對(duì)由視覺(jué)分割得到的網(wǎng)頁(yè)塊進(jìn)行重要度檢測(cè),定位出包含對(duì)象信息的目標(biāo)塊;然后針對(duì)目標(biāo)網(wǎng)頁(yè)塊的二維結(jié)構(gòu)特征建立2D CRFs模型,實(shí)現(xiàn)對(duì)象信息的抽取;最后用對(duì)比實(shí)驗(yàn)驗(yàn)證了該方法的可行性。 (3)基于改進(jìn)的HCRFs的Web對(duì)象信息抽取。 HCRFs是一種可以用于Web對(duì)象抽取的統(tǒng)計(jì)模型,但HCRFs并沒(méi)有完整的描述Web對(duì)象元素之間的條件依賴(lài)關(guān)系,本文提出了一種改進(jìn)的層次條件隨機(jī)域模型LL-HCRFs和一種增加對(duì)象元素間長(zhǎng)距離依賴(lài)關(guān)系的方法,并針對(duì)新增加的依賴(lài)關(guān)系改進(jìn)了原有的參數(shù)估計(jì)算法。最后通過(guò)LL-HCRFs與Liner-CRFs和HCRFs的對(duì)比實(shí)驗(yàn),證明此改進(jìn)模型在對(duì)Web對(duì)象抽取上有著良好的效果。 (4)“搜食計(jì)”垂直搜索引擎。 論文的最后一部分設(shè)計(jì)并實(shí)現(xiàn)了一個(gè)餐飲領(lǐng)域內(nèi)的垂直搜索引擎原型系統(tǒng)“搜食計(jì)”,并對(duì)該原型系統(tǒng)的各個(gè)功能模塊進(jìn)行了詳細(xì)的介紹。
[Abstract]:With the rapid development of the Internet, the information resources on the network are explosive growth, the bottleneck of the general search engine is more and more exposed, in order to locate the information people want more quickly and accurately. Vertical search engines have emerged in recent years. They are search engines for a particular area that provide more refined search results than generic search engines. Therefore, it is necessary to extract domain-related information from web pages. This paper mainly studies the technology of web page information extraction in vertical search engine, including the following aspects:. Web page analysis technology based on visual features. On the basis of studying and studying the visual feature based page segmentation method, the prototype system of VIPS algorithm is implemented, and the system is used to segment the extracted Web pages to provide data preparation for the subsequent extraction work. 2) Web object information extraction based on block importance and 2D CRFs. In this part, a Web object information extraction method based on block importance and 2D CRFs is proposed, which is based on block importance and 2D CRFs. Firstly, the block importance model is used to detect the importance of a web page block obtained by visual segmentation. The target block containing object information is located, and 2D CRFs model is established to extract object information according to the two-dimensional structural features of the target web page block. Finally, the feasibility of the method is verified by a comparative experiment. Web object information extraction based on improved HCRFs. HCRFs is a statistical model that can be used to extract Web objects, but HCRFs does not fully describe the conditional dependencies between Web object elements. In this paper, an improved hierarchical conditional random field model (LL-HCRFs) and a method to increase the long distance dependence between object elements are proposed. Finally, by comparing LL-HCRFs with Liner-CRFs and HCRFs, it is proved that the improved model has a good effect on Web object extraction. 4) search Meter vertical search engine. In the last part of the paper, a vertical search engine prototype system called "food search meter" is designed and implemented, and the functional modules of the prototype system are introduced in detail.
【學(xué)位授予單位】:江南大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2012
【分類(lèi)號(hào)】:TP391.3
【參考文獻(xiàn)】
相關(guān)期刊論文 前3條
1 汪濤,樊孝忠;主題爬蟲(chóng)的設(shè)計(jì)與實(shí)現(xiàn)[J];計(jì)算機(jī)應(yīng)用;2004年S1期
2 王勝,朱明;基于最大熵馬爾可夫模型的地址信息抽取[J];計(jì)算機(jī)工程與應(yīng)用;2005年21期
3 鐘敏娟;郝謙;劉云中;;基于多模板隱馬爾可夫模型的文本信息抽取算法[J];計(jì)算機(jī)工程;2006年02期
相關(guān)碩士學(xué)位論文 前1條
1 段昕;基于視覺(jué)特征中文網(wǎng)頁(yè)分類(lèi)方法的研究[D];山東大學(xué);2007年
,本文編號(hào):1517027
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1517027.html
最近更新
教材專(zhuān)著