基于采樣的Deep Web數(shù)據(jù)源選擇方法研究
發(fā)布時間:2018-09-13 15:36
【摘要】:由于互聯(lián)網(wǎng)信息的飛速發(fā)展,Web中蘊含了海量的信息供人們使用,其中Deep Web數(shù)據(jù)庫是對用戶不可見的,其中涵蓋的信息只能通過特定的查詢接口來查詢獲得。為了充分利用Deep Web中豐富的有價值的信息,以及提高對其查詢的效率,Deep Web數(shù)據(jù)集成系統(tǒng)的建立成為了當(dāng)前的研究熱點。其中,Deep Web數(shù)據(jù)庫的選擇則是此集成系統(tǒng)中查詢處理模塊相當(dāng)重要的環(huán)節(jié)。本文針對Deep Web數(shù)據(jù)源的選擇,從通過采樣的辦法獲取數(shù)據(jù)源特征,評估采樣質(zhì)量,以及根據(jù)選取評價指標(biāo)計算數(shù)據(jù)源的總體得分對數(shù)據(jù)源進(jìn)行排序、選擇,這三個方面進(jìn)行重點研究。第一,本文在基于采樣的隨機(jī)漫步采樣方法的基礎(chǔ)上,針對對于關(guān)鍵字屬性研究的缺失,通過分析采樣過程中屬性分類的問題,提出一種引入關(guān)鍵字屬性并對其進(jìn)行屬性分類的擴(kuò)展方法,同時,進(jìn)一步考慮到已有研究缺乏對分類屬性中含樹形特征的屬性的研究,從而提出樹形分類屬性的概念并給出了在采樣過程中的處理方法。第二,在原始隨機(jī)漫步采樣方法的基礎(chǔ)上,通過保存采樣路徑,使隨后產(chǎn)生的將要進(jìn)行采樣的路徑與已有路徑進(jìn)行掃描比較,據(jù)此提出一種避免擁有部分相同路徑的屬性值產(chǎn)生重復(fù)提交查詢的隨機(jī)漫步方法的改進(jìn)算法,以此對數(shù)據(jù)源進(jìn)行采樣,從而進(jìn)一步提高采樣效率。第三,在采樣評價體系中考慮了樣本與數(shù)據(jù)源的信息內(nèi)容的一致性,將文本信息內(nèi)容的文本相似度計算方法引入采樣質(zhì)量評價體系中來,結(jié)合樣本集與數(shù)據(jù)源比值法對樣本偏差的衡量,進(jìn)一步完善了對采樣質(zhì)量的評價。第四,在采樣結(jié)果所獲樣本集的基礎(chǔ)上,對數(shù)據(jù)源質(zhì)量進(jìn)行評價,給出權(quán)威性、領(lǐng)域相關(guān)性、準(zhǔn)確性、冗余性、時效性這五個評價指標(biāo)對數(shù)據(jù)源質(zhì)量進(jìn)行評估,并給出五項指標(biāo)的量化方法及公式。并在準(zhǔn)確性指標(biāo)計算中,對語義相似度的計算做了相應(yīng)的改進(jìn),將漢明距離的相似度計算方法加入了語義相似度的元素。通過對五個指標(biāo)的綜合評價,得到數(shù)據(jù)源的總體得分,按總分進(jìn)行排序選擇。實驗表明,本文提出的方法,對以往方法存在的問題有了很大的改進(jìn),并進(jìn)一步在采樣質(zhì)量和效率上都有很好的效果和提高,對樣本集的質(zhì)量評估更可靠有效。
[Abstract]:Due to the rapid development of Internet information, there is a huge amount of information for people to use in the web. The Deep Web database is invisible to users, and the information contained therein can only be queried through a specific query interface. In order to make full use of the valuable information in Deep Web and improve the efficiency of query, the establishment of Deep Web data integration system has become a hot research topic. The selection of Deep Web database is an important part of query processing module in this integrated system. According to the selection of Deep Web data sources, this paper obtains the characteristics of the data sources through sampling, evaluates the sampling quality, and sorts the data sources according to the total score of the selected evaluation indicators. These three aspects carry on the key research. First, based on the random sampling method based on sampling, this paper analyzes the problem of attribute classification in the process of sampling, aiming at the lack of research on keyword attributes. In this paper, an extended method of introducing keyword attributes and classifying them is proposed. At the same time, considering the lack of researches on attributes with tree features in classification attributes, Thus, the concept of tree classification attributes is proposed and the processing method in the sampling process is given. Secondly, on the basis of the original random walk sampling method, by preserving the sampling path, the path to be sampled is scanned and compared with the existing path. Based on this, an improved algorithm is proposed to avoid the random walk of the attribute value with part of the same path to generate repeated submission queries, so as to sample the data source and further improve the sampling efficiency. Thirdly, the consistency of information content between sample and data source is considered in the sampling evaluation system, and the text similarity calculation method of text information content is introduced into the sampling quality evaluation system. The evaluation of sampling quality is further improved by measuring the sample deviation by the ratio of sample set to data source. Fourthly, on the basis of the sample set obtained from the sampling results, the quality of the data source is evaluated, and the quality of the data source is evaluated by five evaluation indexes, namely, authority, domain correlation, accuracy, redundancy and timeliness. The quantitative method and formula of five indexes are given. In the accuracy index calculation, the semantic similarity calculation is improved accordingly, and the similarity calculation method of hamming distance is added to the semantic similarity element. Through the comprehensive evaluation of the five indexes, the total score of the data source is obtained, and the ranking selection is carried out according to the total score. The experimental results show that the method proposed in this paper has greatly improved the existing problems of the previous methods, and further improved the sampling quality and efficiency, and is more reliable and effective for the quality evaluation of the sample set.
【學(xué)位授予單位】:上海師范大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2015
【分類號】:TP393.09;TP311.13
本文編號:2241591
[Abstract]:Due to the rapid development of Internet information, there is a huge amount of information for people to use in the web. The Deep Web database is invisible to users, and the information contained therein can only be queried through a specific query interface. In order to make full use of the valuable information in Deep Web and improve the efficiency of query, the establishment of Deep Web data integration system has become a hot research topic. The selection of Deep Web database is an important part of query processing module in this integrated system. According to the selection of Deep Web data sources, this paper obtains the characteristics of the data sources through sampling, evaluates the sampling quality, and sorts the data sources according to the total score of the selected evaluation indicators. These three aspects carry on the key research. First, based on the random sampling method based on sampling, this paper analyzes the problem of attribute classification in the process of sampling, aiming at the lack of research on keyword attributes. In this paper, an extended method of introducing keyword attributes and classifying them is proposed. At the same time, considering the lack of researches on attributes with tree features in classification attributes, Thus, the concept of tree classification attributes is proposed and the processing method in the sampling process is given. Secondly, on the basis of the original random walk sampling method, by preserving the sampling path, the path to be sampled is scanned and compared with the existing path. Based on this, an improved algorithm is proposed to avoid the random walk of the attribute value with part of the same path to generate repeated submission queries, so as to sample the data source and further improve the sampling efficiency. Thirdly, the consistency of information content between sample and data source is considered in the sampling evaluation system, and the text similarity calculation method of text information content is introduced into the sampling quality evaluation system. The evaluation of sampling quality is further improved by measuring the sample deviation by the ratio of sample set to data source. Fourthly, on the basis of the sample set obtained from the sampling results, the quality of the data source is evaluated, and the quality of the data source is evaluated by five evaluation indexes, namely, authority, domain correlation, accuracy, redundancy and timeliness. The quantitative method and formula of five indexes are given. In the accuracy index calculation, the semantic similarity calculation is improved accordingly, and the similarity calculation method of hamming distance is added to the semantic similarity element. Through the comprehensive evaluation of the five indexes, the total score of the data source is obtained, and the ranking selection is carried out according to the total score. The experimental results show that the method proposed in this paper has greatly improved the existing problems of the previous methods, and further improved the sampling quality and efficiency, and is more reliable and effective for the quality evaluation of the sample set.
【學(xué)位授予單位】:上海師范大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2015
【分類號】:TP393.09;TP311.13
【參考文獻(xiàn)】
相關(guān)期刊論文 前6條
1 吳春明;謝德體;;基于領(lǐng)域特征文本的Deep Web分類研究[J];計算機(jī)科學(xué);2012年04期
2 王成良;桑銀邦;;Deep Web集成系統(tǒng)中同類主題數(shù)據(jù)源選擇方法[J];計算機(jī)應(yīng)用研究;2011年09期
3 姜芳艽;孟小峰;;Deep Web數(shù)據(jù)集成中查詢處理的研究與進(jìn)展[J];計算機(jī)科學(xué)與探索;2009年02期
4 凌妍妍;孟小峰;劉偉;;基于屬性相關(guān)度的Web數(shù)據(jù)庫大小估算方法[J];軟件學(xué)報;2008年02期
5 余偉;李石君;文利娟;田建偉;;基于數(shù)據(jù)質(zhì)量的Deep Web數(shù)據(jù)源排序[J];小型微型計算機(jī)系統(tǒng);2010年04期
6 鄧松;萬常選;劉喜平;廖國瓊;;基于用戶反饋的深網(wǎng)數(shù)據(jù)源選擇[J];小型微型計算機(jī)系統(tǒng);2012年11期
,本文編號:2241591
本文鏈接:http://sikaile.net/guanlilunwen/ydhl/2241591.html
最近更新
教材專著