面向特定領(lǐng)域的Deep Web數(shù)據(jù)獲取技術(shù)研究
[Abstract]:With the rapid development of Internet technology, the high quality information resources hidden in Web database have received extensive attention because of its complete structure and huge amount of data. However, this kind of information resource is only displayed in the way of HTML page after the user submits the query to the Web query interface, which makes the traditional search engine unable to obtain, so it is called Deep Web. In order to improve the utilization of Deep Web resources, the data hidden behind the query interface should be displayed in the query result page and extracted into structured data. In this paper, the key technologies of Deep Web data acquisition in specific fields are studied. The research is mainly divided into two parts: data surface and data record extraction. The main research contents are as follows: 1) aiming at the range attributes in Deep Web query interface, a range partition method based on sampling is proposed. This method effectively improves the efficiency of data surfacing in the Top-k query interface. 2) aiming at the classification attributes in the query interface, a hierarchical tree model based data surfacing method is improved. This method reduces the number of query submissions effectively by adjusting the submission order of the type attributes. 3) aiming at the text type attributes in the query interface, this paper adopts a candidate value filtering method. The method uses the distribution of candidate attribute values in the sample database to filter the candidate values and increase the average query income. 4) based on the distribution information of feature nodes in the query result page, a data region location algorithm is proposed. The algorithm combines the structure information of the page with the attribute feature of the data record, which weakens the influence of the page structure change on the extraction effect. 5) in the data record extraction stage, In this paper, a method of data record extraction based on feature sequence partitioning and tree similarity is discussed. This method can not only improve the accuracy of data record extraction, but also align the attributes of data record. In this paper, the effectiveness of the above algorithm is verified by experiments, and a prototype system of Deep Web information integration oriented to the field of electronic commerce is designed.
【學位授予單位】:蘇州大學
【學位級別】:碩士
【學位授予年份】:2012
【分類號】:TP393.09
【參考文獻】
相關(guān)期刊論文 前8條
1 胡東東,孟小峰;一種基于樹結(jié)構(gòu)的Web數(shù)據(jù)自動抽取方法[J];計算機研究與發(fā)展;2004年10期
2 馬安香;張斌;高克寧;齊鵬;張引;;基于結(jié)果模式的Deep Web數(shù)據(jù)抽取[J];計算機研究與發(fā)展;2009年02期
3 申德榮;馬也;聶鐵錚;寇月;于戈;;一種應(yīng)用于Deep Web數(shù)據(jù)集成系統(tǒng)中的查詢松弛策略[J];計算機研究與發(fā)展;2010年01期
4 田建偉;李石君;;基于層次樹模型的Deep Web數(shù)據(jù)提取方法[J];計算機研究與發(fā)展;2011年01期
5 李效東,顧毓清;基于DOM的Web信息提取[J];計算機學報;2002年05期
6 劉玉奎;周立柱;范舉;;中文深度萬維網(wǎng)數(shù)據(jù)庫的現(xiàn)狀研究[J];計算機學報;2011年02期
7 劉偉;孟小峰;凌妍妍;;一種基于圖模型的Web數(shù)據(jù)庫采樣方法[J];軟件學報;2008年02期
8 凌妍妍;孟小峰;劉偉;;基于屬性相關(guān)度的Web數(shù)據(jù)庫大小估算方法[J];軟件學報;2008年02期
相關(guān)碩士學位論文 前2條
1 楊舟;特定領(lǐng)域的Deep Web數(shù)據(jù)抽取與語義標注研究[D];蘇州大學;2011年
2 陳洪平;面向Deep Web的數(shù)據(jù)抽取與語義標注技術(shù)研究[D];蘇州大學;2010年
,本文編號:2142827
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2142827.html