基于視覺信息和DOM樹的Deep Web數(shù)據(jù)自動(dòng)抽取
[Abstract]:With the rapid development of the Internet, it contains a large amount of information resources, covering all fields of the real world. Surface Deep Web contains more data, more traffic and faster growth. However, Deep Web pages are dynamically generated and are difficult to be indexed by traditional search engines. Therefore, how to effectively obtain and utilize the data of Deep Web pages has become an important research direction. Deep Web data is expressed through the query results page, but the data in the web pages are different in form and lack of structure. Easy for users to browse but difficult to use. Based on the visual information of web pages and the structure of DOM tree, this paper studies the automatic data extraction of Deep Web query results page. The main research contents are as follows: (1) locating data regions. Firstly, by analyzing the characteristics of the data region in the Deep Web query result page, we find out the visual features that can make it locate. Then the relevant pages are collected as samples and the nodes in the samples are annotated manually. The corresponding decision tree is obtained by Weka training. Finally, the corresponding rules of the decision tree are used to locate the data region. (2) data records are extracted. This process is divided into two steps: locating data recording and denoising. In the first step, according to the structure and visual characteristics of the DOM tree of the data record in the web page, a data record location algorithm is proposed, but the node obtained from this algorithm contains not only the data record node, but also a little noise. The similarity of data record is defined by xpath, and the data record node is obtained by comparison of similarity. (3) data items are aligned. Firstly, the data record is divided into corresponding data items, then the corresponding data structure is designed to facilitate alignment, and an algorithm for aligning data items is given based on xpath. (4) template. According to the characteristics of data region, data record and data item, the corresponding template is put forward. Through the use of templates, not only a large number of repeated calculations are avoided in the process of extraction, but also the extraction speed is improved, and it is convenient to extract data items from continuous pages. The innovations of this paper are as follows: (1) the concept of xpath is introduced and the similarity of data records is defined by xpath. Through the comparison of xpath, the alignment of data items is completed. (2) the concept of data item granularity is proposed, and the corresponding method of dividing data records into data items is given. Based on the above research, an automatic data extraction system for Deep Web query results page is designed and developed, and other problems encountered in the extraction process are solved. Such as AJAX asynchronous data extraction. Experiments show that this method can extract data from Deep Web query pages quickly and accurately.
【學(xué)位授予單位】:中國(guó)海洋大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2014
【分類號(hào)】:TP393.092
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 寇月;李冬;申德榮;于戈;聶鐵錚;;D-EEM:一種基于DOM樹的Deep Web實(shí)體抽取機(jī)制[J];計(jì)算機(jī)研究與發(fā)展;2010年05期
2 洪輝;李石君;余偉;田建偉;;基于語義的中文Deep Web查詢接口集成[J];計(jì)算機(jī)科學(xué);2008年03期
3 高明;王繼成;李江峰;;基于語義支持的Deep Web數(shù)據(jù)抽取[J];計(jì)算機(jī)科學(xué);2010年03期
4 郭建兵;崔志明;陳明;趙朋朋;;基于DOM樹與領(lǐng)域本體的Web抽取方法[J];計(jì)算機(jī)工程;2012年05期
5 李效東,顧毓清;基于DOM的Web信息提取[J];計(jì)算機(jī)學(xué)報(bào);2002年05期
6 劉偉;孟小峰;孟衛(wèi)一;;Deep Web數(shù)據(jù)集成研究綜述[J];計(jì)算機(jī)學(xué)報(bào);2007年09期
7 范軒苗;鄭寧;范淵;;一種基于Ajax的爬蟲模型的設(shè)計(jì)與實(shí)現(xiàn)[J];計(jì)算機(jī)應(yīng)用與軟件;2010年01期
8 強(qiáng)保華;李巍;鄒顯春;汪天天;吳春明;;基于潛在語義分析的Deep Web查詢接口聚類研究[J];計(jì)算機(jī)科學(xué);2013年11期
9 寇月;申德榮;李冬;聶鐵錚;;一種基于語義及統(tǒng)計(jì)分析的Deep Web實(shí)體識(shí)別機(jī)制[J];軟件學(xué)報(bào);2008年02期
10 袁柳;李戰(zhàn)懷;陳世亮;;基于本體的Deep Web數(shù)據(jù)標(biāo)注[J];軟件學(xué)報(bào);2008年02期
本文編號(hào):2171026
本文鏈接:http://sikaile.net/guanlilunwen/ydhl/2171026.html