面向deep web的數(shù)據(jù)抽取與結(jié)果聚合技術(shù)研究

發(fā)布時間：2018-06-19 11:19

本文選題：deep + web��；參考：《哈爾濱工程大學》2012年碩士論文

【摘要】：隨著計算機網(wǎng)絡的高速發(fā)展，網(wǎng)絡資源越來越豐富，一方面拓寬了人們獲取信息的渠道，另一方面信息的秩序混亂又使得用戶難以浩瀚萬千的信息中獲取需要的信息，搜索引擎為用戶提供網(wǎng)絡信息的檢索與分類功能。在網(wǎng)絡資源中，有一種資源是傳統(tǒng)搜索引擎索引不到的。這種資源叫deep web資源。Deep web資源是指傳統(tǒng)搜索引擎不能索引到的資源，是能夠被訪問的在線web數(shù)據(jù)庫。deep web資源因其資源豐富，專業(yè)性強，自動更新速度快，數(shù)據(jù)海量，，領域范圍廣等優(yōu)點。越來越受到人們的青睞。研究如何對通過deep web查詢接口返回的數(shù)據(jù)進行抽取以及對抽取結(jié)果進行聚合具有重要的理論意義和實踐價值。本文針對deep web資源的數(shù)據(jù)抽取與結(jié)果聚合進行研究，數(shù)據(jù)抽取階段，首先簡要介紹MDR，總結(jié)MDR在deep web頁面信息抽取中遇到的效率問題，從MDR數(shù)據(jù)抽取算法中得到啟示，對MDR算法進行改進以降低數(shù)據(jù)抽取的時間復雜度。抽取算法使用標簽樹對HTML頁面進行表示，在抽取之前對頁面清洗，規(guī)范化并構(gòu)造標簽樹。使用標簽樹的結(jié)構(gòu)相似度定位數(shù)據(jù)記錄。相似度計算方法改進了樹編輯距離算法時間復雜度高的缺點，改進了元素比較法的不能真實反映樹結(jié)構(gòu)的缺點，在面向deep web的數(shù)據(jù)抽取中有較好的抽取效果。然而有些數(shù)據(jù)記錄之間的相似度較低，使用基于標簽樹的相似度的數(shù)據(jù)抽取算法也會有不好的情況，為了解決這種標簽結(jié)構(gòu)的數(shù)據(jù)記錄識別問題，在改進通過標簽樹結(jié)構(gòu)相似度判定數(shù)據(jù)記錄的基礎上，提出一種基于子樹不完全匹配的數(shù)據(jù)記錄抽取算法。結(jié)果聚合主要研究的是抽取結(jié)果去重，在去重之前先按照屬性權(quán)重排序，減少了比較次數(shù)，實現(xiàn)數(shù)據(jù)記錄的快速有效去重。實驗表明，基于標簽樹路徑的結(jié)構(gòu)相似度的數(shù)據(jù)記錄抽取算法的抽取效率比MDR高，同時證明基于子樹不完全匹配的數(shù)據(jù)記錄發(fā)現(xiàn)算法的抽取效果比MDR和基于標簽樹路徑的結(jié)構(gòu)相似度的數(shù)據(jù)記錄抽取算法都好。按照屬性權(quán)重排序后的去重算法比直接去重算法效率要高。
[Abstract]:With the rapid development of computer network, network resources are more and more abundant. On the one hand, it broadens the channels for people to obtain information; on the other hand, the disorder of information makes it difficult for users to obtain the information they need in the vast amount of information. Search engine provides users with the function of searching and classifying network information. In the network resources, there is one kind of resources that the traditional search engine can not index. This kind of resource is called deep web resource. Deep web resource refers to the resource that can not be indexed by traditional search engine. It is an online web database .deep web resource that can be accessed because of its rich resources, strong specialization, fast automatic updating speed and massive data. The advantages of a wide range of fields. People are getting more and more popular. It is of great theoretical and practical value to study how to extract the data returned through the deep web query interface and how to aggregate the extracted results. In this paper, the data extraction and result aggregation of deep web resources are studied. In the stage of data extraction, first of all, the paper briefly introduces MDR, summarizes the efficiency problems encountered by MDR in deep web page information extraction, and draws inspiration from the MDR data extraction algorithm. The MDR algorithm is improved to reduce the time complexity of data extraction. The extraction algorithm uses tag tree to represent HTML pages, and then cleans the pages before extraction, normalizes and constructs the tag tree. The structural similarity of label tree is used to locate the data record. The similarity calculation method improves the high time complexity of tree editing distance algorithm and the disadvantage of element comparison method which can not truly reflect the tree structure. It has a better extraction effect in deep web oriented data extraction. However, the similarity between some data records is low, so it is not good to use the similarity algorithm based on label tree. In order to solve the problem of data record recognition based on label structure, On the basis of improving the similarity of label tree structure to judge data record, a data record extraction algorithm based on subtree mismatch is proposed. Results aggregation is mainly focused on the extraction of the results to remove weight, before the weight of the attribute ranking, reduce the number of comparisons, to achieve the rapid and effective data records. Experimental results show that the extraction efficiency of the data record extraction algorithm based on structural similarity of label tree path is higher than that of MDR. At the same time, it is proved that the extraction effect of the data record discovery algorithm based on subtree mismatch is better than that of MDR and the data record extraction algorithm based on structural similarity of label tree path. The efficiency of the algorithm is higher than that of the direct algorithm.
【學位授予單位】：哈爾濱工程大學
【學位級別】：碩士
【學位授予年份】：2012
【分類號】：TP393.09

【參考文獻】

相關期刊論文前6條

1 陸余良;房珊瑤;劉金紅;施凡;;Deep Web站點分類研究進展[J];安徽大學學報(自然科學版);2010年01期

2 申德榮;劉麗楠;寇月;聶鐵錚;于戈;;一種面向Deep Web數(shù)據(jù)源的重復記錄識別模型[J];電子學報;2010年02期

3 朱倩,黃志軍;一種改進的基于密度和網(wǎng)格的高維聚類算法[J];艦船電子工程;2005年05期

4 劉偉;孟小峰;孟衛(wèi)一;;Deep Web數(shù)據(jù)集成研究綜述[J];計算機學報;2007年09期

5 黃昌寧;趙海;;中文分詞十年回顧[J];中文信息學報;2007年03期

6 李雪冰;;網(wǎng)絡環(huán)境下的信息加工與查準率和查全率[J];中國西部科技(學術(shù));2007年11期

相關碩士學位論文前2條

1 鄭健;聚類和孤立點檢測算法的研究與實現(xiàn)[D];南京航空航天大學;2007年

2 朱國紅;基于特征點選擇的聚類算法研究與應用[D];山東大學;2010年

本文編號：2039713

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2039713.html

上一篇：一種結(jié)合超鏈接分析的搜索引擎排序方法
下一篇：漢語中介語語料庫中的標注拓展

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

面向deep web的數(shù)據(jù)抽取與結(jié)果聚合技術(shù)研究