當(dāng)前位置：主頁 > 管理論文 > 移動網(wǎng)絡(luò)論文 >

基于視覺信息和DOM樹的Deep Web數(shù)據(jù)自動抽取

發(fā)布時間：2018-08-07 18:50

【摘要】：隨著互聯(lián)網(wǎng)的飛速發(fā)展，其中已蘊含了海量的信息資源，涵蓋了現(xiàn)實世界的各個領(lǐng)域。相對于Surface Web，，Deep Web蘊含著更豐富的數(shù)據(jù)、擁有更多的訪問量和更快的增長速度。但是Deep Web頁面是動態(tài)生成的，難以被傳統(tǒng)搜索引擎索引到。因此，如何有效地獲取和利用Deep Web頁面的數(shù)據(jù)成為一個重要的研究方向。Deep Web數(shù)據(jù)通過查詢結(jié)果頁面表現(xiàn)出來，但是網(wǎng)頁中的數(shù)據(jù)形式各異、缺乏結(jié)構(gòu)性，便于用戶瀏覽卻難以利用。本文基于網(wǎng)頁的視覺信息和DOM樹結(jié)構(gòu)，對Deep Web查詢結(jié)果頁面的數(shù)據(jù)自動抽取進(jìn)行了研究，主要研究內(nèi)容如下： (1)定位數(shù)據(jù)區(qū)域。首先通過分析Deep Web查詢結(jié)果頁面中數(shù)據(jù)區(qū)域的特點，找到能夠使之定位的視覺特征。然后收集了相關(guān)頁面作為樣本，并對樣本中的節(jié)點進(jìn)行手工標(biāo)注。通過Weka訓(xùn)練得到相應(yīng)的決策樹，最后使用該決策樹對應(yīng)的規(guī)則來定位數(shù)據(jù)區(qū)域。 (2)抽取數(shù)據(jù)記錄。這個過程分為兩步：定位數(shù)據(jù)記錄和去噪。第一步，根據(jù)網(wǎng)頁中數(shù)據(jù)記錄的DOM樹的結(jié)構(gòu)特點及其視覺特征，提出了數(shù)據(jù)記錄定位算法，但是由此得到的節(jié)點中不僅包含了數(shù)據(jù)記錄節(jié)點，還有少量的噪音；第二步，通過xpath定義了數(shù)據(jù)記錄的相似度，并通過相似度比較進(jìn)行去噪，從而得到數(shù)據(jù)記錄節(jié)點。 (3)對齊數(shù)據(jù)項。首先將數(shù)據(jù)記錄劃分成相應(yīng)的數(shù)據(jù)項，然后為便于對齊設(shè)計了相應(yīng)的數(shù)據(jù)結(jié)構(gòu)，并基于xpath給出了對齊數(shù)據(jù)項的算法。 (4)模板。針對數(shù)據(jù)區(qū)域、數(shù)據(jù)記錄以及數(shù)據(jù)項各自的特點，提出了相應(yīng)的模板。通過模板的使用，不僅在抽取過程中避免了大量重復(fù)的計算，提高了抽取速度，而且方便實現(xiàn)連續(xù)頁面的數(shù)據(jù)項抽取。論文的創(chuàng)新點如下：(1)引入了xpath的概念，通過xpath定義了數(shù)據(jù)記錄的相似度，從而進(jìn)行數(shù)據(jù)記錄的去噪。并通過xpath的比較完成了數(shù)據(jù)項的對齊。(2)提出了數(shù)據(jù)項粒度的概念，并給出了將數(shù)據(jù)記錄劃分為數(shù)據(jù)項的相應(yīng)方法。在以上研究的基礎(chǔ)上，設(shè)計開發(fā)了Deep Web查詢結(jié)果頁面的數(shù)據(jù)自動抽取系統(tǒng)，并且解決了抽取過程中遇到的其他問題。如AJAX異步數(shù)據(jù)的抽取等。實驗表明，本文方法可以快速、準(zhǔn)確地從Deep Web查詢結(jié)果頁面中抽取數(shù)據(jù)。
[Abstract]:With the rapid development of the Internet, it contains a large amount of information resources, covering all fields of the real world. Surface Deep Web contains more data, more traffic and faster growth. However, Deep Web pages are dynamically generated and are difficult to be indexed by traditional search engines. Therefore, how to effectively obtain and utilize the data of Deep Web pages has become an important research direction. Deep Web data is expressed through the query results page, but the data in the web pages are different in form and lack of structure. Easy for users to browse but difficult to use. Based on the visual information of web pages and the structure of DOM tree, this paper studies the automatic data extraction of Deep Web query results page. The main research contents are as follows: (1) locating data regions. Firstly, by analyzing the characteristics of the data region in the Deep Web query result page, we find out the visual features that can make it locate. Then the relevant pages are collected as samples and the nodes in the samples are annotated manually. The corresponding decision tree is obtained by Weka training. Finally, the corresponding rules of the decision tree are used to locate the data region. (2) data records are extracted. This process is divided into two steps: locating data recording and denoising. In the first step, according to the structure and visual characteristics of the DOM tree of the data record in the web page, a data record location algorithm is proposed, but the node obtained from this algorithm contains not only the data record node, but also a little noise. The similarity of data record is defined by xpath, and the data record node is obtained by comparison of similarity. (3) data items are aligned. Firstly, the data record is divided into corresponding data items, then the corresponding data structure is designed to facilitate alignment, and an algorithm for aligning data items is given based on xpath. (4) template. According to the characteristics of data region, data record and data item, the corresponding template is put forward. Through the use of templates, not only a large number of repeated calculations are avoided in the process of extraction, but also the extraction speed is improved, and it is convenient to extract data items from continuous pages. The innovations of this paper are as follows: (1) the concept of xpath is introduced and the similarity of data records is defined by xpath. Through the comparison of xpath, the alignment of data items is completed. (2) the concept of data item granularity is proposed, and the corresponding method of dividing data records into data items is given. Based on the above research, an automatic data extraction system for Deep Web query results page is designed and developed, and other problems encountered in the extraction process are solved. Such as AJAX asynchronous data extraction. Experiments show that this method can extract data from Deep Web query pages quickly and accurately.
【學(xué)位授予單位】：中國海洋大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2014
【分類號】：TP393.092

【參考文獻(xiàn)】

相關(guān)期刊論文前10條

1 寇月;李冬;申德榮;于戈;聶鐵錚;;D-EEM:一種基于DOM樹的Deep Web實體抽取機(jī)制[J];計算機(jī)研究與發(fā)展;2010年05期

2 洪輝;李石君;余偉;田建偉;;基于語義的中文Deep Web查詢接口集成[J];計算機(jī)科學(xué);2008年03期

3 高明;王繼成;李江峰;;基于語義支持的Deep Web數(shù)據(jù)抽取[J];計算機(jī)科學(xué);2010年03期

4 郭建兵;崔志明;陳明;趙朋朋;;基于DOM樹與領(lǐng)域本體的Web抽取方法[J];計算機(jī)工程;2012年05期

5 李效東,顧毓清;基于DOM的Web信息提取[J];計算機(jī)學(xué)報;2002年05期

6 劉偉;孟小峰;孟衛(wèi)一;;Deep Web數(shù)據(jù)集成研究綜述[J];計算機(jī)學(xué)報;2007年09期

7 范軒苗;鄭寧;范淵;;一種基于Ajax的爬蟲模型的設(shè)計與實現(xiàn)[J];計算機(jī)應(yīng)用與軟件;2010年01期

8 強(qiáng)保華;李巍;鄒顯春;汪天天;吳春明;;基于潛在語義分析的Deep Web查詢接口聚類研究[J];計算機(jī)科學(xué);2013年11期

9 寇月;申德榮;李冬;聶鐵錚;;一種基于語義及統(tǒng)計分析的Deep Web實體識別機(jī)制[J];軟件學(xué)報;2008年02期

10 袁柳;李戰(zhàn)懷;陳世亮;;基于本體的Deep Web數(shù)據(jù)標(biāo)注[J];軟件學(xué)報;2008年02期

本文編號：2171026

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/guanlilunwen/ydhl/2171026.html

上一篇：我國科普網(wǎng)站發(fā)展中的問題與對策研究
下一篇：基于K-匿名的電子商務(wù)匿名方法研究

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于視覺信息和DOM樹的Deep Web數(shù)據(jù)自動抽取