天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當(dāng)前位置:主頁 > 科技論文 > 搜索引擎論文 >

面向deep web的數(shù)據(jù)抽取與結(jié)果聚合技術(shù)研究

發(fā)布時(shí)間:2018-06-19 11:19

  本文選題:deep + web ; 參考:《哈爾濱工程大學(xué)》2012年碩士論文


【摘要】:隨著計(jì)算機(jī)網(wǎng)絡(luò)的高速發(fā)展,網(wǎng)絡(luò)資源越來越豐富,一方面拓寬了人們獲取信息的渠道,另一方面信息的秩序混亂又使得用戶難以浩瀚萬千的信息中獲取需要的信息,搜索引擎為用戶提供網(wǎng)絡(luò)信息的檢索與分類功能。在網(wǎng)絡(luò)資源中,有一種資源是傳統(tǒng)搜索引擎索引不到的。這種資源叫deep web資源。Deep web資源是指?jìng)鹘y(tǒng)搜索引擎不能索引到的資源,是能夠被訪問的在線web數(shù)據(jù)庫。deep web資源因其資源豐富,專業(yè)性強(qiáng),自動(dòng)更新速度快,數(shù)據(jù)海量,,領(lǐng)域范圍廣等優(yōu)點(diǎn)。越來越受到人們的青睞。研究如何對(duì)通過deep web查詢接口返回的數(shù)據(jù)進(jìn)行抽取以及對(duì)抽取結(jié)果進(jìn)行聚合具有重要的理論意義和實(shí)踐價(jià)值。 本文針對(duì)deep web資源的數(shù)據(jù)抽取與結(jié)果聚合進(jìn)行研究,數(shù)據(jù)抽取階段,首先簡(jiǎn)要介紹MDR,總結(jié)MDR在deep web頁面信息抽取中遇到的效率問題,從MDR數(shù)據(jù)抽取算法中得到啟示,對(duì)MDR算法進(jìn)行改進(jìn)以降低數(shù)據(jù)抽取的時(shí)間復(fù)雜度。抽取算法使用標(biāo)簽樹對(duì)HTML頁面進(jìn)行表示,在抽取之前對(duì)頁面清洗,規(guī)范化并構(gòu)造標(biāo)簽樹。使用標(biāo)簽樹的結(jié)構(gòu)相似度定位數(shù)據(jù)記錄。相似度計(jì)算方法改進(jìn)了樹編輯距離算法時(shí)間復(fù)雜度高的缺點(diǎn),改進(jìn)了元素比較法的不能真實(shí)反映樹結(jié)構(gòu)的缺點(diǎn),在面向deep web的數(shù)據(jù)抽取中有較好的抽取效果。然而有些數(shù)據(jù)記錄之間的相似度較低,使用基于標(biāo)簽樹的相似度的數(shù)據(jù)抽取算法也會(huì)有不好的情況,為了解決這種標(biāo)簽結(jié)構(gòu)的數(shù)據(jù)記錄識(shí)別問題,在改進(jìn)通過標(biāo)簽樹結(jié)構(gòu)相似度判定數(shù)據(jù)記錄的基礎(chǔ)上,提出一種基于子樹不完全匹配的數(shù)據(jù)記錄抽取算法。結(jié)果聚合主要研究的是抽取結(jié)果去重,在去重之前先按照屬性權(quán)重排序,減少了比較次數(shù),實(shí)現(xiàn)數(shù)據(jù)記錄的快速有效去重。 實(shí)驗(yàn)表明,基于標(biāo)簽樹路徑的結(jié)構(gòu)相似度的數(shù)據(jù)記錄抽取算法的抽取效率比MDR高,同時(shí)證明基于子樹不完全匹配的數(shù)據(jù)記錄發(fā)現(xiàn)算法的抽取效果比MDR和基于標(biāo)簽樹路徑的結(jié)構(gòu)相似度的數(shù)據(jù)記錄抽取算法都好。按照屬性權(quán)重排序后的去重算法比直接去重算法效率要高。
[Abstract]:With the rapid development of computer network, network resources are more and more abundant. On the one hand, it broadens the channels for people to obtain information; on the other hand, the disorder of information makes it difficult for users to obtain the information they need in the vast amount of information. Search engine provides users with the function of searching and classifying network information. In the network resources, there is one kind of resources that the traditional search engine can not index. This kind of resource is called deep web resource. Deep web resource refers to the resource that can not be indexed by traditional search engine. It is an online web database .deep web resource that can be accessed because of its rich resources, strong specialization, fast automatic updating speed and massive data. The advantages of a wide range of fields. People are getting more and more popular. It is of great theoretical and practical value to study how to extract the data returned through the deep web query interface and how to aggregate the extracted results. In this paper, the data extraction and result aggregation of deep web resources are studied. In the stage of data extraction, first of all, the paper briefly introduces MDR, summarizes the efficiency problems encountered by MDR in deep web page information extraction, and draws inspiration from the MDR data extraction algorithm. The MDR algorithm is improved to reduce the time complexity of data extraction. The extraction algorithm uses tag tree to represent HTML pages, and then cleans the pages before extraction, normalizes and constructs the tag tree. The structural similarity of label tree is used to locate the data record. The similarity calculation method improves the high time complexity of tree editing distance algorithm and the disadvantage of element comparison method which can not truly reflect the tree structure. It has a better extraction effect in deep web oriented data extraction. However, the similarity between some data records is low, so it is not good to use the similarity algorithm based on label tree. In order to solve the problem of data record recognition based on label structure, On the basis of improving the similarity of label tree structure to judge data record, a data record extraction algorithm based on subtree mismatch is proposed. Results aggregation is mainly focused on the extraction of the results to remove weight, before the weight of the attribute ranking, reduce the number of comparisons, to achieve the rapid and effective data records. Experimental results show that the extraction efficiency of the data record extraction algorithm based on structural similarity of label tree path is higher than that of MDR. At the same time, it is proved that the extraction effect of the data record discovery algorithm based on subtree mismatch is better than that of MDR and the data record extraction algorithm based on structural similarity of label tree path. The efficiency of the algorithm is higher than that of the direct algorithm.
【學(xué)位授予單位】:哈爾濱工程大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2012
【分類號(hào)】:TP393.09

【參考文獻(xiàn)】

相關(guān)期刊論文 前6條

1 陸余良;房珊瑤;劉金紅;施凡;;Deep Web站點(diǎn)分類研究進(jìn)展[J];安徽大學(xué)學(xué)報(bào)(自然科學(xué)版);2010年01期

2 申德榮;劉麗楠;寇月;聶鐵錚;于戈;;一種面向Deep Web數(shù)據(jù)源的重復(fù)記錄識(shí)別模型[J];電子學(xué)報(bào);2010年02期

3 朱倩,黃志軍;一種改進(jìn)的基于密度和網(wǎng)格的高維聚類算法[J];艦船電子工程;2005年05期

4 劉偉;孟小峰;孟衛(wèi)一;;Deep Web數(shù)據(jù)集成研究綜述[J];計(jì)算機(jī)學(xué)報(bào);2007年09期

5 黃昌寧;趙海;;中文分詞十年回顧[J];中文信息學(xué)報(bào);2007年03期

6 李雪冰;;網(wǎng)絡(luò)環(huán)境下的信息加工與查準(zhǔn)率和查全率[J];中國(guó)西部科技(學(xué)術(shù));2007年11期

相關(guān)碩士學(xué)位論文 前2條

1 鄭健;聚類和孤立點(diǎn)檢測(cè)算法的研究與實(shí)現(xiàn)[D];南京航空航天大學(xué);2007年

2 朱國(guó)紅;基于特征點(diǎn)選擇的聚類算法研究與應(yīng)用[D];山東大學(xué);2010年



本文編號(hào):2039713

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2039713.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶2c12c***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請(qǐng)E-mail郵箱bigeng88@qq.com
神马午夜福利免费视频| 亚洲精品欧美精品日韩精品| av一区二区三区天堂| 青青操在线视频精品视频| 国产免费一区二区三区不卡| 久久人人爽人人爽大片av| 欧美午夜国产在线观看| 一区二区三区18禁看| 成人午夜在线视频观看| 精品少妇人妻一区二区三区| 久久大香蕉一区二区三区| 国产丝袜极品黑色高跟鞋| 国产在线视频好看不卡| 亚洲高清中文字幕一区二区三区| 国产av精品高清一区二区三区| 午夜色午夜视频之日本| 国产成人高清精品尤物| 亚洲精品欧美精品日韩精品| 手机在线观看亚洲中文字幕| 国产亚洲欧美另类久久久| 九九九热视频免费观看| 欧美91精品国产自产| 国产一区二区久久综合| 国产欧美日韩综合精品二区| 少妇特黄av一区二区三区| 午夜国产福利在线播放| 97人妻精品免费一区二区| 一区二区日本一区二区欧美| 中文字幕精品一区二区三| 我的性感妹妹在线观看| 麻豆91成人国产在线观看| 美日韩一区二区精品系列 | 在线观看国产午夜福利| 色偷偷亚洲女人天堂观看 | 国产一区二区三区不卡| 国产美女精品午夜福利视频| 免费午夜福利不卡片在线 视频| 五月婷婷综合缴情六月| 91欧美亚洲视频在线| 日本淫片一区二区三区| 国产a天堂一区二区专区|