面向Deep Web數(shù)據(jù)集成的數(shù)據(jù)融合問題研究
發(fā)布時(shí)間:2018-02-11 18:49
本文關(guān)鍵詞: Deep Web數(shù)據(jù)集成 Deep Web數(shù)據(jù)源質(zhì)量評(píng)估 數(shù)據(jù)融合 出處:《山東大學(xué)》2012年碩士論文 論文類型:學(xué)位論文
【摘要】:隨著互聯(lián)網(wǎng)技術(shù)的進(jìn)步和發(fā)展,Web包含了越來越多的豐富信息,從而使Web成為了一個(gè)巨大的、分布廣泛的、全球化的在線信息源。尤其是近些年來,各式各樣的大型數(shù)據(jù)庫(kù)逐漸建立起來,以應(yīng)對(duì)各種個(gè)人或商業(yè)需求,Web已經(jīng)逐漸成為人們生活中必不可少的一部分。Web上的數(shù)據(jù)雜亂無章,信息種類復(fù)雜多樣,如果按照數(shù)據(jù)被訪問的途徑,可將整個(gè)Web分為Surface Web(淺層網(wǎng)絡(luò))和DeepWeb(深層網(wǎng)絡(luò))。其中,Surface Web是指Web中通過超鏈接可以被傳統(tǒng)搜索引擎索引到的靜態(tài)頁(yè)面的集合;而Deep Web是指Web中可訪問的在線數(shù)據(jù)庫(kù),其內(nèi)容不能被傳統(tǒng)的搜索引擎索引,而是隱藏在查詢接口后面。通過研究表明,DeepWeb有數(shù)據(jù)量大、領(lǐng)域覆蓋全面、主題性強(qiáng)、信息結(jié)構(gòu)化程度高等特點(diǎn)。為了充分利用這些寶貴的資源,用于進(jìn)一步的分析和挖掘,我們迫切的需要對(duì)Deep Web進(jìn)行數(shù)據(jù)集成。 在各個(gè)領(lǐng)域,Deep Web信息量呈爆炸式增長(zhǎng)著,數(shù)據(jù)源的種類和信息的類型也越來越多樣化。然而,這些信息是并不總是可信的,而且不同的數(shù)據(jù)源往往提供提供異構(gòu)的、沖突的數(shù)據(jù),如何從這些海量的信息中獲得人們所真正需要的、正確的信息,成為信息集成所面臨的一大挑戰(zhàn)。因此,我們需要通過數(shù)據(jù)融合來去偽存真,獲得高質(zhì)量的數(shù)據(jù),為分析決策提供支持。 數(shù)據(jù)融合技術(shù)已經(jīng)獲得了越來越多的關(guān)注,許多研究工作者也在這一領(lǐng)域做出了很多的貢獻(xiàn)。目前,數(shù)據(jù)融合工作仍然存在以下問題有待解決:(1) Deep Web上的數(shù)據(jù)源質(zhì)量參差不齊,其提供的值的質(zhì)量也差別很大,質(zhì)量較高的數(shù)據(jù)源所提供的值的置信度往往更高。所以我們需要在數(shù)據(jù)融合之前首先對(duì)各個(gè)數(shù)據(jù)源進(jìn)行質(zhì)量評(píng)估,并將評(píng)估結(jié)果運(yùn)用到真值發(fā)現(xiàn)的過程中去(2)目前還沒有一個(gè)較為完善、標(biāo)準(zhǔn)的方法來進(jìn)行數(shù)據(jù)融合,所以需要綜合考慮數(shù)據(jù)源的準(zhǔn)確度、數(shù)據(jù)源之間的依賴、值之間的牽連度等若干因素,來解決數(shù)據(jù)沖突,發(fā)現(xiàn)真值。 本文以面向Deep Web的數(shù)據(jù)集成為目標(biāo),在Deep Web數(shù)據(jù)源質(zhì)量評(píng)估和真值發(fā)現(xiàn)方法等方面做了很多的研究和探索,主要工作和貢獻(xiàn)概括如下: 1.本文提出了一種Deep Web數(shù)據(jù)源質(zhì)量評(píng)估模型。Deep Web上各個(gè)數(shù)據(jù)源有很大的差異性,不同質(zhì)量的數(shù)據(jù)源往往提供不同質(zhì)量的數(shù)據(jù)。但是,目前大部分?jǐn)?shù)據(jù)融合的研究并不專門對(duì)數(shù)據(jù)源進(jìn)行質(zhì)量評(píng)估,而是在計(jì)算之初給各個(gè)數(shù)據(jù)源質(zhì)量賦相同的初值,并通過迭代算法不停的改進(jìn)和完善數(shù)據(jù)源的質(zhì)量。為了更好的進(jìn)行數(shù)據(jù)融合,我們提出了一種在數(shù)據(jù)融合之前進(jìn)行Deep Web數(shù)據(jù)源質(zhì)量評(píng)估的方法,該方法將針對(duì)數(shù)據(jù)融合的特點(diǎn),選取數(shù)據(jù)質(zhì)量、接口頁(yè)面質(zhì)量和服務(wù)質(zhì)量三個(gè)維度的多個(gè)因素作為評(píng)估標(biāo)準(zhǔn),分別對(duì)各個(gè)質(zhì)量評(píng)估因素進(jìn)行量化,最后對(duì)各個(gè)數(shù)據(jù)源的質(zhì)量進(jìn)行統(tǒng)一評(píng)分,得到各個(gè)數(shù)據(jù)源的質(zhì)量評(píng)估結(jié)果,并將評(píng)估結(jié)果運(yùn)用到之后的數(shù)據(jù)融合中去。實(shí)驗(yàn)證明,我們的模型能夠?qū)?shù)據(jù)源質(zhì)量進(jìn)行較為準(zhǔn)確的評(píng)估,并且如果將得到的評(píng)估結(jié)果運(yùn)用到數(shù)據(jù)融合過程中,可以對(duì)數(shù)據(jù)融合有明顯的改進(jìn)作用。 2.本文提出了一種面向Deep Web數(shù)據(jù)集成的真值發(fā)現(xiàn)方法。在各個(gè)領(lǐng)域,Deep Web上的數(shù)據(jù)量激增,同時(shí)也存在著大量的沖突數(shù)據(jù),所以如何從這些大量沖突數(shù)據(jù)中發(fā)現(xiàn)人們所需要的、正確的值變得至關(guān)重要。我們結(jié)合自己的研究背景(面向市場(chǎng)情報(bào)的數(shù)據(jù)集成),提出了一種面向Deep Web數(shù)據(jù)集成的數(shù)據(jù)融合計(jì)算模型。該模型綜合考慮了數(shù)據(jù)源的準(zhǔn)確度、數(shù)據(jù)源之間的依賴度、不同值之間的牽連度等因素,從沖突數(shù)據(jù)中找到真值。由于這幾個(gè)因素之間是相互作用的,所以我們迭代的計(jì)算這幾個(gè)因素,不停的改進(jìn)這些因素的值,直到結(jié)果收斂。同時(shí)我們也將數(shù)據(jù)源質(zhì)量評(píng)估的結(jié)果運(yùn)用到我們的模型中來。通過實(shí)驗(yàn)數(shù)據(jù)證明,我們所提出的真值發(fā)現(xiàn)模型有效性更高。
[Abstract]:With the progress and development of Internet technology, Web contains rich information more and more, so Web has become a huge, widely distributed, online source of information globalization. Especially in recent years, a large database of every kind of gradually set up in response to a variety of personal or business needs, Web has gradually become an essential data people living in a part of the.Web on the out of order, the information types are complicated, if in accordance with the way of the data access, we can divide the whole Web into Surface Web (shallow network) and DeepWeb (deep layer network). Among them, Surface Web refers to Web through hyperlinks can be a collection of traditional search engine static pages the index to the Deep; Web refers to the online database can be accessed in Web, its content can not be indexed by traditional search engines, but hidden behind the query interfaces. The study shows, DeepWeb has the characteristics of large data volume, wide coverage, strong theme and high level of information structure. In order to make full use of these valuable resources for further analysis and mining, we urgently need data integration for Deep Web.
In every field, Deep Web amount of information exploding, the types of information and the data source is more and more diversified. However, this information is not always reliable, and different data sources often provide heterogeneous, conflicting data, how to get the real needs of the people from the sea quantity the correct information, information, information integration has become a big challenge faced. Therefore, we need to come true through data fusion, to obtain high quality data for analysis and decision.
Data fusion technology has gained more and more attention, many researchers have made many contributions in this field. At present, data fusion still has many problems to be solved: (1) Deep Web data source quality is uneven, it provides the value of quality difference, high quality data sources the supplied value often have higher confidence. So we need before data fusion quality assessment of each data source, and apply the estimation results to the true values found in the process of (2) there is not a more perfect, standard method for data fusion, so it is necessary to consider the data the source of the accuracy of the data dependence between sources, the implication between values of several factors to resolve data conflicts, find the true value.
In this paper, aiming at data integration for Deep Web, we have done many researches and explorations in Deep Web data source quality assessment and truth value detection methods. The main contributions and contributions are summarized as follows.
1. this paper proposes a Deep Web data source quality assessment model of.Deep Web on each data source has a lot of difference, the quality of different data sources often provide different quality of data. However, most of the current research of data fusion is not specifically for the data source quality assessment, but the quality of the various data sources to assign the same initial value in the beginning of the calculation, and improve and perfect the quality of the data source through the iterative algorithm. In order to keep the data fusion better, we propose a method for the quality evaluation of Deep Web data source in the data fusion, the method will be selected according to the characteristics of data fusion, data quality, multiple factors interface page quality and service quality of the three dimensions as evaluation criteria, respectively, quantifying the quality evaluation factors, and finally unified score on the quality of the various data sources, each The assessment results of the quality of the data source, and apply the estimation results to after data fusion. Experiments show that our model is able to evaluate accurately the quality of data source, and if the evaluation results will be applied to the data fusion process, data fusion can significantly improve the performance.
2. this paper presents a method to find the true value for Deep Web data integration. In various fields, the amount of data of Deep Web on the surge, there are also a large number of data conflicts, so how from these massive data found in the conflict of people need, the correct value becomes very important. We combine the research background own (for market intelligence data integration), is proposed for Deep Web data integration and data fusion model. The model considers the accuracy of the data source, data dependence between sources, different values between the implicated factors, find the true value from the data. The conflict between these factors interact with each other, so we calculate the iteration of these factors, constantly improve the values of these factors, until the results converge. At the same time we will also use data source quality assessment results to our model. From the experimental data, we have shown that the true value discovery model is more effective.
【學(xué)位授予單位】:山東大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2012
【分類號(hào)】:TP202;TP311.13
【參考文獻(xiàn)】
相關(guān)期刊論文 前4條
1 胡鵬昱;趙朋朋;方巍;崔志明;;深網(wǎng)數(shù)據(jù)源質(zhì)量估計(jì)模型[J];計(jì)算機(jī)工程;2009年09期
2 凌妍妍;孟小峰;劉偉;;基于屬性相關(guān)度的Web數(shù)據(jù)庫(kù)大小估算方法[J];軟件學(xué)報(bào);2008年02期
3 胡鵬昱;苗忠義;崔志明;方巍;;擴(kuò)展的Deep Web質(zhì)量估計(jì)模型研究[J];微電子學(xué)與計(jì)算機(jī);2008年09期
4 趙朋朋;崔志明;高嶺;仲華;;關(guān)于中國(guó)Deep Web的規(guī)模、分布和結(jié)構(gòu)[J];小型微型計(jì)算機(jī)系統(tǒng);2007年10期
,本文編號(hào):1503748
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1503748.html
最近更新
教材專著