天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于視覺信息和DOM樹的Deep Web數(shù)據(jù)自動(dòng)抽取

發(fā)布時(shí)間:2018-08-07 18:50
【摘要】:隨著互聯(lián)網(wǎng)的飛速發(fā)展,其中已蘊(yùn)含了海量的信息資源,涵蓋了現(xiàn)實(shí)世界的各個(gè)領(lǐng)域。相對(duì)于Surface Web,,Deep Web蘊(yùn)含著更豐富的數(shù)據(jù)、擁有更多的訪問量和更快的增長(zhǎng)速度。但是Deep Web頁面是動(dòng)態(tài)生成的,難以被傳統(tǒng)搜索引擎索引到。因此,如何有效地獲取和利用Deep Web頁面的數(shù)據(jù)成為一個(gè)重要的研究方向。Deep Web數(shù)據(jù)通過查詢結(jié)果頁面表現(xiàn)出來,但是網(wǎng)頁中的數(shù)據(jù)形式各異、缺乏結(jié)構(gòu)性,便于用戶瀏覽卻難以利用。本文基于網(wǎng)頁的視覺信息和DOM樹結(jié)構(gòu),對(duì)Deep Web查詢結(jié)果頁面的數(shù)據(jù)自動(dòng)抽取進(jìn)行了研究,主要研究?jī)?nèi)容如下: (1)定位數(shù)據(jù)區(qū)域。首先通過分析Deep Web查詢結(jié)果頁面中數(shù)據(jù)區(qū)域的特點(diǎn),找到能夠使之定位的視覺特征。然后收集了相關(guān)頁面作為樣本,并對(duì)樣本中的節(jié)點(diǎn)進(jìn)行手工標(biāo)注。通過Weka訓(xùn)練得到相應(yīng)的決策樹,最后使用該決策樹對(duì)應(yīng)的規(guī)則來定位數(shù)據(jù)區(qū)域。 (2)抽取數(shù)據(jù)記錄。這個(gè)過程分為兩步:定位數(shù)據(jù)記錄和去噪。第一步,根據(jù)網(wǎng)頁中數(shù)據(jù)記錄的DOM樹的結(jié)構(gòu)特點(diǎn)及其視覺特征,提出了數(shù)據(jù)記錄定位算法,但是由此得到的節(jié)點(diǎn)中不僅包含了數(shù)據(jù)記錄節(jié)點(diǎn),還有少量的噪音;第二步,通過xpath定義了數(shù)據(jù)記錄的相似度,并通過相似度比較進(jìn)行去噪,從而得到數(shù)據(jù)記錄節(jié)點(diǎn)。 (3)對(duì)齊數(shù)據(jù)項(xiàng)。首先將數(shù)據(jù)記錄劃分成相應(yīng)的數(shù)據(jù)項(xiàng),然后為便于對(duì)齊設(shè)計(jì)了相應(yīng)的數(shù)據(jù)結(jié)構(gòu),并基于xpath給出了對(duì)齊數(shù)據(jù)項(xiàng)的算法。 (4)模板。針對(duì)數(shù)據(jù)區(qū)域、數(shù)據(jù)記錄以及數(shù)據(jù)項(xiàng)各自的特點(diǎn),提出了相應(yīng)的模板。通過模板的使用,不僅在抽取過程中避免了大量重復(fù)的計(jì)算,提高了抽取速度,而且方便實(shí)現(xiàn)連續(xù)頁面的數(shù)據(jù)項(xiàng)抽取。 論文的創(chuàng)新點(diǎn)如下:(1)引入了xpath的概念,通過xpath定義了數(shù)據(jù)記錄的相似度,從而進(jìn)行數(shù)據(jù)記錄的去噪。并通過xpath的比較完成了數(shù)據(jù)項(xiàng)的對(duì)齊。(2)提出了數(shù)據(jù)項(xiàng)粒度的概念,并給出了將數(shù)據(jù)記錄劃分為數(shù)據(jù)項(xiàng)的相應(yīng)方法。 在以上研究的基礎(chǔ)上,設(shè)計(jì)開發(fā)了Deep Web查詢結(jié)果頁面的數(shù)據(jù)自動(dòng)抽取系統(tǒng),并且解決了抽取過程中遇到的其他問題。如AJAX異步數(shù)據(jù)的抽取等。實(shí)驗(yàn)表明,本文方法可以快速、準(zhǔn)確地從Deep Web查詢結(jié)果頁面中抽取數(shù)據(jù)。
[Abstract]:With the rapid development of the Internet, it contains a large amount of information resources, covering all fields of the real world. Surface Deep Web contains more data, more traffic and faster growth. However, Deep Web pages are dynamically generated and are difficult to be indexed by traditional search engines. Therefore, how to effectively obtain and utilize the data of Deep Web pages has become an important research direction. Deep Web data is expressed through the query results page, but the data in the web pages are different in form and lack of structure. Easy for users to browse but difficult to use. Based on the visual information of web pages and the structure of DOM tree, this paper studies the automatic data extraction of Deep Web query results page. The main research contents are as follows: (1) locating data regions. Firstly, by analyzing the characteristics of the data region in the Deep Web query result page, we find out the visual features that can make it locate. Then the relevant pages are collected as samples and the nodes in the samples are annotated manually. The corresponding decision tree is obtained by Weka training. Finally, the corresponding rules of the decision tree are used to locate the data region. (2) data records are extracted. This process is divided into two steps: locating data recording and denoising. In the first step, according to the structure and visual characteristics of the DOM tree of the data record in the web page, a data record location algorithm is proposed, but the node obtained from this algorithm contains not only the data record node, but also a little noise. The similarity of data record is defined by xpath, and the data record node is obtained by comparison of similarity. (3) data items are aligned. Firstly, the data record is divided into corresponding data items, then the corresponding data structure is designed to facilitate alignment, and an algorithm for aligning data items is given based on xpath. (4) template. According to the characteristics of data region, data record and data item, the corresponding template is put forward. Through the use of templates, not only a large number of repeated calculations are avoided in the process of extraction, but also the extraction speed is improved, and it is convenient to extract data items from continuous pages. The innovations of this paper are as follows: (1) the concept of xpath is introduced and the similarity of data records is defined by xpath. Through the comparison of xpath, the alignment of data items is completed. (2) the concept of data item granularity is proposed, and the corresponding method of dividing data records into data items is given. Based on the above research, an automatic data extraction system for Deep Web query results page is designed and developed, and other problems encountered in the extraction process are solved. Such as AJAX asynchronous data extraction. Experiments show that this method can extract data from Deep Web query pages quickly and accurately.
【學(xué)位授予單位】:中國(guó)海洋大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2014
【分類號(hào)】:TP393.092

【參考文獻(xiàn)】

相關(guān)期刊論文 前10條

1 寇月;李冬;申德榮;于戈;聶鐵錚;;D-EEM:一種基于DOM樹的Deep Web實(shí)體抽取機(jī)制[J];計(jì)算機(jī)研究與發(fā)展;2010年05期

2 洪輝;李石君;余偉;田建偉;;基于語義的中文Deep Web查詢接口集成[J];計(jì)算機(jī)科學(xué);2008年03期

3 高明;王繼成;李江峰;;基于語義支持的Deep Web數(shù)據(jù)抽取[J];計(jì)算機(jī)科學(xué);2010年03期

4 郭建兵;崔志明;陳明;趙朋朋;;基于DOM樹與領(lǐng)域本體的Web抽取方法[J];計(jì)算機(jī)工程;2012年05期

5 李效東,顧毓清;基于DOM的Web信息提取[J];計(jì)算機(jī)學(xué)報(bào);2002年05期

6 劉偉;孟小峰;孟衛(wèi)一;;Deep Web數(shù)據(jù)集成研究綜述[J];計(jì)算機(jī)學(xué)報(bào);2007年09期

7 范軒苗;鄭寧;范淵;;一種基于Ajax的爬蟲模型的設(shè)計(jì)與實(shí)現(xiàn)[J];計(jì)算機(jī)應(yīng)用與軟件;2010年01期

8 強(qiáng)保華;李巍;鄒顯春;汪天天;吳春明;;基于潛在語義分析的Deep Web查詢接口聚類研究[J];計(jì)算機(jī)科學(xué);2013年11期

9 寇月;申德榮;李冬;聶鐵錚;;一種基于語義及統(tǒng)計(jì)分析的Deep Web實(shí)體識(shí)別機(jī)制[J];軟件學(xué)報(bào);2008年02期

10 袁柳;李戰(zhàn)懷;陳世亮;;基于本體的Deep Web數(shù)據(jù)標(biāo)注[J];軟件學(xué)報(bào);2008年02期



本文編號(hào):2171026

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/guanlilunwen/ydhl/2171026.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶42ece***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請(qǐng)E-mail郵箱bigeng88@qq.com