基于視覺分塊與語義DOM的Deep Web信息抽取研究
發(fā)布時間:2018-06-23 07:27
本文選題:數(shù)據(jù)抽取 + DOM樹; 參考:《上海師范大學》2016年碩士論文
【摘要】:隱藏在普通搜索引擎的背后,需要用戶提交表單查詢并從后臺數(shù)據(jù)庫中返回結(jié)果頁面才能獲取到的信息,稱為Deep Web。當前對Deep Web數(shù)據(jù)抽取的研究是一個比較熱門的話題。隨著頁面結(jié)構(gòu)變得越來越復雜,以及動態(tài)網(wǎng)頁技術的引入,使得Deep Web頁面存在異構(gòu)性和半結(jié)構(gòu)化的特點。如何快速有效地從這些半結(jié)構(gòu)化的結(jié)果頁面中抽取用戶感興趣的數(shù)據(jù)以提供特定的服務成為一個難點。目前研究的主要問題包括:(1)如何有效快速地識別噪聲信息,使得在對原始頁面分析之前盡可能對頁面進行清洗;(2)如何根據(jù)DOM樹結(jié)構(gòu)和頁面視覺信息快速定位頁面的主數(shù)據(jù)區(qū)域;(3)如何不受頁面結(jié)構(gòu)差異的影響盡可能自動地抽取頁面數(shù)據(jù)。針對上述問題,傳統(tǒng)的單一的基于DOM樹的頁面分析方法已經(jīng)無法滿足用戶的需求。因為單一的基于DOM樹的頁面分析方法主要依賴DOM樹的結(jié)構(gòu)特征,需要解析頁面所有的標簽將其轉(zhuǎn)化為DOM樹,忽略了頁面的一些有效的視覺特征,并且一旦頁面的結(jié)構(gòu)發(fā)生變化,需要重新對頁面的結(jié)構(gòu)進行分析再抽取。目前,微軟亞洲研究院提出了一種新的頁面數(shù)據(jù)抽取方法—VIPS算法。VIPS算法打破了以往傳統(tǒng)的基于DOM樹抽取方法,從人的視覺角度出發(fā),把頁面分割為一個個有效的視覺塊,并對這些視覺塊進行語義重組,形成一棵視覺塊樹。該算法在DOM樹結(jié)構(gòu)和頁面的語義之間建立了橋梁。本文通過分析Deep Web結(jié)果頁面的特點,結(jié)合人的視覺特征,在VIPS算法的基礎上提出了一種基于基準視覺塊的Deep Web信息抽取方法。該方法首先對頁面的標簽進行了分析,在解析器將Web文檔解析成語法樹之前,將Web頁面一些與主題無關的信息(例如導航欄、廣告)等去除,并對優(yōu)化后的DOM樹利用VIPS算法對其進行語義分塊,分塊后根據(jù)坐標位置首先尋找到基準視覺塊,以該基準視覺塊作為中心位置逆序和順序遍歷DOM樹并采用線性特征向量判別法尋找所有相似的視覺塊對其進行抽取。從實驗效果來看,本文提出的基于基準視覺塊的頁面數(shù)據(jù)提取方法具有一定的可行性并在提取數(shù)據(jù)的準確率方面與傳統(tǒng)的方法相比有了一定的提高。
[Abstract]:Hidden behind the ordinary search engine, users need to submit form query and return the result page from the background database to get the information, called Deep Web. At present, the research on Deep Web data extraction is a hot topic. With the increasing complexity of page structure and the introduction of dynamic web technology, Deep Web pages are characterized by heterogeneity and semi-structure. How to quickly and effectively extract data of interest from these semi-structured result pages to provide specific services has become a difficult problem. The main problems are as follows: (1) how to identify noise information effectively and quickly, It can clean the page as much as possible before analyzing the original page; (2) how to quickly locate the main data area of the page according to Dom tree structure and page visual information; (3) how to extract page data as automatically as possible without the influence of page structure difference. To solve the above problems, the traditional single Dom tree based page analysis method can not meet the needs of users. Because a single Dom tree-based page analysis method mainly depends on the Dom tree's structural features, it needs to parse all the tags of the page to transform it into a Dom tree, ignoring some effective visual features of the page, and once the structure of the page changes, The structure of the page needs to be re-analyzed and extracted. At present, Microsoft Asia Research Institute has proposed a new page data extraction method-VIPS algorithm. VIPS algorithm breaks the traditional DOM-based tree extraction method and divides the page into effective visual blocks from the point of view of human vision. And these visual blocks are semantically reorganized to form a visual block tree. The algorithm establishes a bridge between the Dom tree structure and the semantics of the page. Based on the analysis of the features of Deep Web result pages and the visual features of human beings, this paper proposes a method of extracting Deep Web information based on reference visual blocks based on VIPs algorithm. Before parser parses the Web document into a syntax tree, it removes some topic-independent information (such as navigation bar, advertisement) from the Web page. The optimized Dom tree is divided into semantic blocks by using VIPS algorithm, and the reference visual block is first found according to the coordinate position. Taking the reference visual block as the center position, the Dom tree is traversed in reverse order and sequentially, and all similar visual blocks are extracted by linear eigenvector discriminant method. From the experimental results, the proposed page data extraction method based on the benchmark visual block is feasible and the accuracy of data extraction is improved compared with the traditional method.
【學位授予單位】:上海師范大學
【學位級別】:碩士
【學位授予年份】:2016
【分類號】:TP393.092
【參考文獻】
相關期刊論文 前10條
1 吳茜;劉嘉勇;卿粼波;;基于VIPS算法和模糊字典匹配的網(wǎng)頁提取技術研究[J];信息網(wǎng)絡安全;2014年10期
2 顧韻華;高原;高寶;杜杰;;基于模板和領域本體的Deep Web信息抽取研究[J];計算機工程與設計;2014年01期
3 郭迎春;劉一偉;陳召旭;;Deep Web數(shù)據(jù)抽取的分析與研究[J];南開大學學報(自然科學版);2012年03期
4 邵崇潔;陳麗君;徐貝;丁特戰(zhàn);;Deep Web表單標簽提取探究[J];電腦知識與技術;2012年16期
5 馮永;張洋;;結(jié)合匹配度和語義相似度的Deep Web查詢接口模式匹配[J];計算機應用;2012年06期
6 趙海霞;李道申;劉勇;趙嘉誠;;一種Deep Web查詢結(jié)果的實體抽取方法[J];計算機工程與應用;2012年36期
7 張亮;陸余良;袁桓;張e,
本文編號:2056343
本文鏈接:http://sikaile.net/guanlilunwen/ydhl/2056343.html
最近更新
教材專著