面向deep web的數(shù)據(jù)抽取與結(jié)果聚合技術(shù)研究
本文選題:deep + web ; 參考:《哈爾濱工程大學(xué)》2012年碩士論文
【摘要】:隨著計(jì)算機(jī)網(wǎng)絡(luò)的高速發(fā)展,網(wǎng)絡(luò)資源越來越豐富,一方面拓寬了人們獲取信息的渠道,另一方面信息的秩序混亂又使得用戶難以浩瀚萬千的信息中獲取需要的信息,搜索引擎為用戶提供網(wǎng)絡(luò)信息的檢索與分類功能。在網(wǎng)絡(luò)資源中,有一種資源是傳統(tǒng)搜索引擎索引不到的。這種資源叫deep web資源。Deep web資源是指?jìng)鹘y(tǒng)搜索引擎不能索引到的資源,是能夠被訪問的在線web數(shù)據(jù)庫。deep web資源因其資源豐富,專業(yè)性強(qiáng),自動(dòng)更新速度快,數(shù)據(jù)海量,,領(lǐng)域范圍廣等優(yōu)點(diǎn)。越來越受到人們的青睞。研究如何對(duì)通過deep web查詢接口返回的數(shù)據(jù)進(jìn)行抽取以及對(duì)抽取結(jié)果進(jìn)行聚合具有重要的理論意義和實(shí)踐價(jià)值。 本文針對(duì)deep web資源的數(shù)據(jù)抽取與結(jié)果聚合進(jìn)行研究,數(shù)據(jù)抽取階段,首先簡(jiǎn)要介紹MDR,總結(jié)MDR在deep web頁面信息抽取中遇到的效率問題,從MDR數(shù)據(jù)抽取算法中得到啟示,對(duì)MDR算法進(jìn)行改進(jìn)以降低數(shù)據(jù)抽取的時(shí)間復(fù)雜度。抽取算法使用標(biāo)簽樹對(duì)HTML頁面進(jìn)行表示,在抽取之前對(duì)頁面清洗,規(guī)范化并構(gòu)造標(biāo)簽樹。使用標(biāo)簽樹的結(jié)構(gòu)相似度定位數(shù)據(jù)記錄。相似度計(jì)算方法改進(jìn)了樹編輯距離算法時(shí)間復(fù)雜度高的缺點(diǎn),改進(jìn)了元素比較法的不能真實(shí)反映樹結(jié)構(gòu)的缺點(diǎn),在面向deep web的數(shù)據(jù)抽取中有較好的抽取效果。然而有些數(shù)據(jù)記錄之間的相似度較低,使用基于標(biāo)簽樹的相似度的數(shù)據(jù)抽取算法也會(huì)有不好的情況,為了解決這種標(biāo)簽結(jié)構(gòu)的數(shù)據(jù)記錄識(shí)別問題,在改進(jìn)通過標(biāo)簽樹結(jié)構(gòu)相似度判定數(shù)據(jù)記錄的基礎(chǔ)上,提出一種基于子樹不完全匹配的數(shù)據(jù)記錄抽取算法。結(jié)果聚合主要研究的是抽取結(jié)果去重,在去重之前先按照屬性權(quán)重排序,減少了比較次數(shù),實(shí)現(xiàn)數(shù)據(jù)記錄的快速有效去重。 實(shí)驗(yàn)表明,基于標(biāo)簽樹路徑的結(jié)構(gòu)相似度的數(shù)據(jù)記錄抽取算法的抽取效率比MDR高,同時(shí)證明基于子樹不完全匹配的數(shù)據(jù)記錄發(fā)現(xiàn)算法的抽取效果比MDR和基于標(biāo)簽樹路徑的結(jié)構(gòu)相似度的數(shù)據(jù)記錄抽取算法都好。按照屬性權(quán)重排序后的去重算法比直接去重算法效率要高。
[Abstract]:With the rapid development of computer network, network resources are more and more abundant. On the one hand, it broadens the channels for people to obtain information; on the other hand, the disorder of information makes it difficult for users to obtain the information they need in the vast amount of information. Search engine provides users with the function of searching and classifying network information. In the network resources, there is one kind of resources that the traditional search engine can not index. This kind of resource is called deep web resource. Deep web resource refers to the resource that can not be indexed by traditional search engine. It is an online web database .deep web resource that can be accessed because of its rich resources, strong specialization, fast automatic updating speed and massive data. The advantages of a wide range of fields. People are getting more and more popular. It is of great theoretical and practical value to study how to extract the data returned through the deep web query interface and how to aggregate the extracted results. In this paper, the data extraction and result aggregation of deep web resources are studied. In the stage of data extraction, first of all, the paper briefly introduces MDR, summarizes the efficiency problems encountered by MDR in deep web page information extraction, and draws inspiration from the MDR data extraction algorithm. The MDR algorithm is improved to reduce the time complexity of data extraction. The extraction algorithm uses tag tree to represent HTML pages, and then cleans the pages before extraction, normalizes and constructs the tag tree. The structural similarity of label tree is used to locate the data record. The similarity calculation method improves the high time complexity of tree editing distance algorithm and the disadvantage of element comparison method which can not truly reflect the tree structure. It has a better extraction effect in deep web oriented data extraction. However, the similarity between some data records is low, so it is not good to use the similarity algorithm based on label tree. In order to solve the problem of data record recognition based on label structure, On the basis of improving the similarity of label tree structure to judge data record, a data record extraction algorithm based on subtree mismatch is proposed. Results aggregation is mainly focused on the extraction of the results to remove weight, before the weight of the attribute ranking, reduce the number of comparisons, to achieve the rapid and effective data records. Experimental results show that the extraction efficiency of the data record extraction algorithm based on structural similarity of label tree path is higher than that of MDR. At the same time, it is proved that the extraction effect of the data record discovery algorithm based on subtree mismatch is better than that of MDR and the data record extraction algorithm based on structural similarity of label tree path. The efficiency of the algorithm is higher than that of the direct algorithm.
【學(xué)位授予單位】:哈爾濱工程大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2012
【分類號(hào)】:TP393.09
【參考文獻(xiàn)】
相關(guān)期刊論文 前6條
1 陸余良;房珊瑤;劉金紅;施凡;;Deep Web站點(diǎn)分類研究進(jìn)展[J];安徽大學(xué)學(xué)報(bào)(自然科學(xué)版);2010年01期
2 申德榮;劉麗楠;寇月;聶鐵錚;于戈;;一種面向Deep Web數(shù)據(jù)源的重復(fù)記錄識(shí)別模型[J];電子學(xué)報(bào);2010年02期
3 朱倩,黃志軍;一種改進(jìn)的基于密度和網(wǎng)格的高維聚類算法[J];艦船電子工程;2005年05期
4 劉偉;孟小峰;孟衛(wèi)一;;Deep Web數(shù)據(jù)集成研究綜述[J];計(jì)算機(jī)學(xué)報(bào);2007年09期
5 黃昌寧;趙海;;中文分詞十年回顧[J];中文信息學(xué)報(bào);2007年03期
6 李雪冰;;網(wǎng)絡(luò)環(huán)境下的信息加工與查準(zhǔn)率和查全率[J];中國(guó)西部科技(學(xué)術(shù));2007年11期
相關(guān)碩士學(xué)位論文 前2條
1 鄭健;聚類和孤立點(diǎn)檢測(cè)算法的研究與實(shí)現(xiàn)[D];南京航空航天大學(xué);2007年
2 朱國(guó)紅;基于特征點(diǎn)選擇的聚類算法研究與應(yīng)用[D];山東大學(xué);2010年
本文編號(hào):2039713
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2039713.html