面向特定領(lǐng)域的Deep Web數(shù)據(jù)獲取技術(shù)研究

發(fā)布時間：2018-07-25 06:03

【摘要】：隨著互聯(lián)網(wǎng)技術(shù)的飛速發(fā)展,隱藏在Web數(shù)據(jù)庫中高質(zhì)量的信息資源因為結(jié)構(gòu)完整、數(shù)據(jù)量巨大而受到廣泛的關(guān)注。然而這類信息資源只有用戶向Web查詢接口提交查詢后才會以HTML頁面的方式展現(xiàn)出來,使得傳統(tǒng)的搜索引擎無法獲取,因而被稱為Deep Web。為了提高Deep Web資源的利用程度,需要將隱藏在查詢接口之后的數(shù)據(jù)展現(xiàn)到查詢結(jié)果頁面中,并將其抽取成為結(jié)構(gòu)化的數(shù)據(jù)。本文對特定領(lǐng)域的Deep Web數(shù)據(jù)獲取關(guān)鍵技術(shù)進行了研究。研究主要分為兩部分：數(shù)據(jù)表面化以及數(shù)據(jù)記錄抽取。主要研究內(nèi)容如下： 1)針對Deep Web查詢接口中的范圍型屬性,提出了一種基于采樣的值域劃分方法。該方法有效地提高了Top-k查詢接口中數(shù)據(jù)表面化的效率。 2)針對查詢接口中的分類型屬性,改進一種基于層次樹模型的數(shù)據(jù)表面化方法。該方法通過調(diào)整分類型屬性的提交順序,有效的減少了查詢提交的次數(shù)。 3)針對查詢接口中的文本型屬性,本文采用了一種候選值篩選的方法。該方法利用候選屬性值在樣本庫中的分布,對候選值進行篩選,增加了平均查詢收益。 4)根據(jù)查詢結(jié)果頁面中特征節(jié)點的分布信息,提出了一種數(shù)據(jù)區(qū)域定位算法。該算法將頁面的結(jié)構(gòu)信息和數(shù)據(jù)記錄的屬性特征結(jié)合起來,弱化了網(wǎng)頁結(jié)構(gòu)變更對抽取效果的影響。 5)在數(shù)據(jù)記錄抽取階段,本文討論了一種特征序列劃分和樹相似度相結(jié)合的數(shù)據(jù)記錄抽取方法。該方法不但可以提高數(shù)據(jù)記錄抽取的準確率,而且能夠?qū)R數(shù)據(jù)記錄的屬性。本文通過實驗驗證了上述算法的有效性,并設(shè)計了面向電子商務(wù)領(lǐng)域的Deep Web信息集成原型系統(tǒng)。
[Abstract]:With the rapid development of Internet technology, the high quality information resources hidden in Web database have received extensive attention because of its complete structure and huge amount of data. However, this kind of information resource is only displayed in the way of HTML page after the user submits the query to the Web query interface, which makes the traditional search engine unable to obtain, so it is called Deep Web. In order to improve the utilization of Deep Web resources, the data hidden behind the query interface should be displayed in the query result page and extracted into structured data. In this paper, the key technologies of Deep Web data acquisition in specific fields are studied. The research is mainly divided into two parts: data surface and data record extraction. The main research contents are as follows: 1) aiming at the range attributes in Deep Web query interface, a range partition method based on sampling is proposed. This method effectively improves the efficiency of data surfacing in the Top-k query interface. 2) aiming at the classification attributes in the query interface, a hierarchical tree model based data surfacing method is improved. This method reduces the number of query submissions effectively by adjusting the submission order of the type attributes. 3) aiming at the text type attributes in the query interface, this paper adopts a candidate value filtering method. The method uses the distribution of candidate attribute values in the sample database to filter the candidate values and increase the average query income. 4) based on the distribution information of feature nodes in the query result page, a data region location algorithm is proposed. The algorithm combines the structure information of the page with the attribute feature of the data record, which weakens the influence of the page structure change on the extraction effect. 5) in the data record extraction stage, In this paper, a method of data record extraction based on feature sequence partitioning and tree similarity is discussed. This method can not only improve the accuracy of data record extraction, but also align the attributes of data record. In this paper, the effectiveness of the above algorithm is verified by experiments, and a prototype system of Deep Web information integration oriented to the field of electronic commerce is designed.
【學位授予單位】：蘇州大學
【學位級別】：碩士
【學位授予年份】：2012
【分類號】：TP393.09

【參考文獻】

相關(guān)期刊論文前8條

1 胡東東,孟小峰;一種基于樹結(jié)構(gòu)的Web數(shù)據(jù)自動抽取方法[J];計算機研究與發(fā)展;2004年10期

2 馬安香;張斌;高克寧;齊鵬;張引;;基于結(jié)果模式的Deep Web數(shù)據(jù)抽取[J];計算機研究與發(fā)展;2009年02期

3 申德榮;馬也;聶鐵錚;寇月;于戈;;一種應(yīng)用于Deep Web數(shù)據(jù)集成系統(tǒng)中的查詢松弛策略[J];計算機研究與發(fā)展;2010年01期

4 田建偉;李石君;;基于層次樹模型的Deep Web數(shù)據(jù)提取方法[J];計算機研究與發(fā)展;2011年01期

5 李效東,顧毓清;基于DOM的Web信息提取[J];計算機學報;2002年05期

6 劉玉奎;周立柱;范舉;;中文深度萬維網(wǎng)數(shù)據(jù)庫的現(xiàn)狀研究[J];計算機學報;2011年02期

7 劉偉;孟小峰;凌妍妍;;一種基于圖模型的Web數(shù)據(jù)庫采樣方法[J];軟件學報;2008年02期

8 凌妍妍;孟小峰;劉偉;;基于屬性相關(guān)度的Web數(shù)據(jù)庫大小估算方法[J];軟件學報;2008年02期

相關(guān)碩士學位論文前2條

1 楊舟;特定領(lǐng)域的Deep Web數(shù)據(jù)抽取與語義標注研究[D];蘇州大學;2011年

2 陳洪平;面向Deep Web的數(shù)據(jù)抽取與語義標注技術(shù)研究[D];蘇州大學;2010年

，

本文編號：2142827

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2142827.html

上一篇：基于聯(lián)合相似度的協(xié)同過濾推薦算法研究
下一篇：模因論視角下的“X-哥”

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

面向特定領(lǐng)域的Deep Web數(shù)據(jù)獲取技術(shù)研究