天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當(dāng)前位置:主頁(yè) > 碩博論文 > 信息類博士論文 >

數(shù)據(jù)時(shí)效性的理論和算法研究

發(fā)布時(shí)間:2018-06-21 21:43

  本文選題:數(shù)據(jù)質(zhì)量 + 數(shù)據(jù)可用性; 參考:《哈爾濱工業(yè)大學(xué)》2016年博士論文


【摘要】:隨著大數(shù)據(jù)時(shí)代的到來(lái),數(shù)據(jù)的可用性受到廣泛的關(guān)注。真實(shí)世界會(huì)隨著時(shí)間的流逝迅速變化,進(jìn)而導(dǎo)致數(shù)據(jù)庫(kù)中的數(shù)據(jù)過(guò)時(shí)失效。當(dāng)前已有統(tǒng)計(jì)表明過(guò)時(shí)數(shù)據(jù)會(huì)對(duì)企業(yè)決策和國(guó)民生活造成眾多不良影響,且會(huì)引起其他維度的可用性下降,如引起數(shù)據(jù)不一致、不精確、不完整等,因此確保數(shù)據(jù)的時(shí)效性至關(guān)重要。當(dāng)前數(shù)據(jù)可用性領(lǐng)域?qū)τ跁r(shí)效性的研究仍然不成體系,數(shù)據(jù)時(shí)效性研究面臨極大挑戰(zhàn)。首先,很多數(shù)據(jù)庫(kù)中都沒(méi)有精確可用的時(shí)間戳,這使得數(shù)據(jù)集合在給定時(shí)刻的時(shí)效性,即絕對(duì)時(shí)效性,很難判定。其次,不同的查詢或應(yīng)用場(chǎng)景對(duì)時(shí)效性有不同的要求,在一些情境下絕對(duì)時(shí)效性可能無(wú)法判定,這使得數(shù)據(jù)相對(duì)于查詢或者用戶的時(shí)效性判定尤為重要。第三,在得到數(shù)據(jù)庫(kù)的時(shí)效性判定結(jié)果之后,必須進(jìn)一步給出數(shù)據(jù)時(shí)效性的修復(fù)方法,當(dāng)前數(shù)據(jù)可用性領(lǐng)域的研究并沒(méi)有給出可以直接用于修復(fù)時(shí)效性的數(shù)據(jù)修復(fù)方法。第四,在僅有一個(gè)數(shù)據(jù)源的情況下,完全地修復(fù)一個(gè)數(shù)據(jù)庫(kù)是非常困難,甚至不可行的。因?yàn)椴煌瑪?shù)據(jù)源包含的數(shù)據(jù)不同,所以往往要需要根據(jù)現(xiàn)有知識(shí),將來(lái)自其他數(shù)據(jù)源的數(shù)據(jù)和目標(biāo)數(shù)據(jù)源的最新值整合起來(lái)才能得到完整的目標(biāo)數(shù)據(jù)表最新值。為了有效地應(yīng)對(duì)上述挑戰(zhàn),本文嘗試給出一系列理論和算法,解決了數(shù)據(jù)時(shí)效性的一些關(guān)鍵問(wèn)題,主要研究?jī)?nèi)容可以概括如下。(1)本文研究了數(shù)據(jù)絕對(duì)時(shí)效性的表達(dá)原理及判定算法。為了克服當(dāng)前基于時(shí)間戳和基于規(guī)則的兩類時(shí)效性判定方法的局限性,形式化地定義了不確定時(shí)效規(guī)則及相應(yīng)的數(shù)據(jù)時(shí)效性模型。該規(guī)則和模型可以表達(dá)不確定的領(lǐng)域知識(shí),定量地判定數(shù)據(jù)時(shí)效性,且能夠判定數(shù)據(jù)在特定時(shí)刻是否過(guò)時(shí)。在此基礎(chǔ)上,本文首先研究了不確定時(shí)效規(guī)則的基礎(chǔ)問(wèn)題,如公理化、可滿足、蘊(yùn)含等問(wèn)題;然后給出了定量地判定數(shù)據(jù)時(shí)效性的模型,分別定義了數(shù)據(jù)項(xiàng)、元組、數(shù)據(jù)集合的時(shí)效性;接著,將數(shù)據(jù)項(xiàng)間的時(shí)序關(guān)系構(gòu)建成時(shí)序圖,并基于時(shí)序圖給出了多項(xiàng)式時(shí)間的時(shí)效性判定算法;最后在真實(shí)數(shù)據(jù)上的實(shí)驗(yàn)驗(yàn)證了算法的有效性。(2)本文研究了數(shù)據(jù)相對(duì)時(shí)效性表達(dá)原理及判定算法。在數(shù)據(jù)的絕對(duì)時(shí)效性無(wú)法判定,或判定結(jié)果不能有效地表達(dá)用戶需求的情況下,可以利用一些冗余記錄和時(shí)效約規(guī)則來(lái)實(shí)現(xiàn)數(shù)據(jù)相對(duì)時(shí)效性的判定。本文借助冗余記錄和時(shí)效規(guī)則研究數(shù)據(jù)相對(duì)時(shí)效性判定問(wèn)題,建立了相對(duì)時(shí)效性的判定模型并提出了相關(guān)求解算法。本文首先定義了查詢相關(guān)時(shí)效性,將查詢歸結(jié)為最新值查詢和時(shí)效序列查詢兩類,對(duì)每類查詢,設(shè)計(jì)了查詢結(jié)果的時(shí)效性判定方法,并將每類查詢作為一個(gè)整體,給出了數(shù)據(jù)集合相對(duì)于一類查詢的平均時(shí)效性判定方法;然后,將用戶按查詢偏好分為3類,研究了用戶相關(guān)時(shí)效性;最后在真實(shí)數(shù)據(jù)和虛擬數(shù)據(jù)上分別進(jìn)行了實(shí)驗(yàn),驗(yàn)證了算法的有效性,分析了各參數(shù)對(duì)算法的影響。(3)本文研究了基于規(guī)則的數(shù)據(jù)時(shí)效性錯(cuò)誤修復(fù)模型及修復(fù)算法。將數(shù)據(jù)庫(kù)中的過(guò)時(shí)數(shù)據(jù)修復(fù)為最新值是提高數(shù)據(jù)質(zhì)量的關(guān)鍵步驟。當(dāng)前主要有基于規(guī)則和基于統(tǒng)計(jì)兩類數(shù)據(jù)修復(fù)方法:基于規(guī)則的修復(fù)方法難以表達(dá)數(shù)據(jù)中某些復(fù)雜的關(guān)聯(lián)關(guān)系,而基于統(tǒng)計(jì)的方法需要學(xué)習(xí)較復(fù)雜的條件概率分布,且難以直接應(yīng)用數(shù)據(jù)語(yǔ)義相關(guān)的領(lǐng)域知識(shí)。為了克服上述兩類方法的缺點(diǎn),本文提出一類新的修復(fù)規(guī)則,將規(guī)則和統(tǒng)計(jì)的方法結(jié)合起來(lái)修復(fù)過(guò)時(shí)數(shù)據(jù),該規(guī)則一方面能夠通過(guò)規(guī)則模式表達(dá)領(lǐng)域知識(shí),另一方面還能夠使用其特有的分布表來(lái)描述數(shù)據(jù)隨時(shí)間變化的統(tǒng)計(jì)信息。首先,本文研究了靜態(tài)數(shù)據(jù)上的最小規(guī)則模式生成問(wèn)題,證明了靜態(tài)數(shù)據(jù)上的規(guī)則模式生成問(wèn)題是NP-難的,并給出了兩個(gè)解決該問(wèn)題的多項(xiàng)式時(shí)間近似算法。接著,本文研究了動(dòng)態(tài)數(shù)據(jù)上的最小規(guī)則模式生成問(wèn)題,給出算法可在數(shù)據(jù)動(dòng)態(tài)變化的情況下迅速更新現(xiàn)有的規(guī)則模式集合,最好情況下,只需O(1)時(shí)間即可完成更新。同時(shí),本文還給出了靜態(tài)數(shù)據(jù)上的分布表學(xué)習(xí)算法和數(shù)據(jù)動(dòng)態(tài)變化情況下的分布表更新算法。然后,本文研究了不同修復(fù)代價(jià)約束條件下的最優(yōu)修復(fù)計(jì)劃產(chǎn)生問(wèn)題,證明了在修復(fù)預(yù)算為正無(wú)窮時(shí),該問(wèn)題在多項(xiàng)式時(shí)間內(nèi)可解,否則該問(wèn)題是NP-難的,并給出了上述兩種情況下該問(wèn)題的解決方法。最后本文通過(guò)真實(shí)和虛擬數(shù)據(jù)集合上的實(shí)驗(yàn)證明了上述方法的有效性。(4)本文研究了基于查詢的數(shù)據(jù)時(shí)效性錯(cuò)誤修復(fù)問(wèn)題。在數(shù)據(jù)集成或Web環(huán)境下,許多數(shù)據(jù)表被分散地存儲(chǔ)在不同地方,這些數(shù)據(jù)表之間往往存在著部分?jǐn)?shù)據(jù)重疊的情況,但不同數(shù)據(jù)源的更新頻率不盡相同。如果我們向某數(shù)據(jù)源請(qǐng)求一個(gè)數(shù)據(jù)表或發(fā)出一個(gè)查詢,往往會(huì)因?yàn)閿?shù)據(jù)源更新不及時(shí)而無(wú)法得到目標(biāo)數(shù)據(jù)表的最新數(shù)據(jù)。為了將目標(biāo)數(shù)據(jù)表修復(fù)為最新值,需根據(jù)數(shù)據(jù)庫(kù)中的時(shí)序約束和參照完整性約束構(gòu)造一個(gè)合取查詢,使得該查詢的結(jié)果恰由目標(biāo)數(shù)據(jù)表對(duì)應(yīng)的最新值構(gòu)成,稱為時(shí)效保持查詢。本文研究在給定數(shù)據(jù)庫(kù)時(shí)序關(guān)系和參照完整性約束的情況下時(shí)效保持查詢構(gòu)造問(wèn)題。首先,本文給出了時(shí)效保持查詢的形式化定義,使用該查詢可以給出目標(biāo)數(shù)據(jù)表的最新值。接著,本文定義了模式時(shí)效圖,用于表達(dá)數(shù)據(jù)庫(kù)中不同數(shù)據(jù)表之間的時(shí)序約束和參照完整性約束,并將時(shí)效保持查詢等價(jià)的表達(dá)為圖中的一個(gè)終點(diǎn)樹。然后,本文形式化了最小時(shí)效保持查詢生成問(wèn)題,證明了最小化時(shí)效保持查詢是一個(gè)NP-難問(wèn)題,并分別給出了不同情況下的最小化時(shí)效保持查詢算法;最后,本文通過(guò)實(shí)驗(yàn)驗(yàn)證了所提模型和算法的有效性。
[Abstract]:With the advent of the big data age, the availability of data has been widely concerned. The real world will change rapidly over time, resulting in data outdated data in the database. The current statistics show that outdated data will cause many undesirable effects on enterprise decision-making and national life, and will cause other dimensions to be available. Drop, such as causing data inconsistency, inaccuracy, incomplete and so on, so it is essential to ensure timeliness of data. The current data availability field is still not a system for timeliness, and data aging research is facing great challenges. First, many databases have no precise time stamps, which makes the data set at a time time. Engraved timeliness, namely absolute timeliness, is difficult to determine. Secondly, different queries or application scenarios have different requirements for timeliness, and in some situations the absolute timeliness may not be judged, which makes the data relative to the query or user's timeliness determination is particularly important. Third, after getting the results of the timeliness of the database, It is necessary to further give a method of data aging repair. Research in the field of current data availability does not give a data repair method that can be directly used to repair the timeliness. Fourth, it is very difficult and even infeasible to completely repair a database in the case of only one data source. The data is different, so it is often necessary to integrate the latest data from other data sources and target data sources to get the latest value of the target data table. In order to cope with the challenges mentioned above, a series of theories and algorithms are given to solve the key problems of data aging. The main research contents can be summarized as follows. (1) this paper studies the principle of absolute timeliness of data and the algorithm of decision. In order to overcome the limitations of the two kinds of time stamp based time stamp and rule based time limitation method, we formally define the uncertain Aging Rule and the corresponding data aging model. To express uncertain domain knowledge, determine data timeliness quantificationally and determine whether data is out of date at a specific time. On the basis of this, this paper first studies the basic problems of uncertain aging rules, such as axiom, satisfaction and implication, and then gives a model to determine the timeliness of data quantificationally, and defines the number of data. According to the item, the data set is timeliness of the data set; then, the time series relation between the data items is constructed into a time series graph, and the time timeliness determination algorithm of polynomial time is given based on the time series graph. Finally, the validity of the algorithm is verified on the real data. (2) the data relative timeliness expression principle and the decision algorithm are studied in this paper. When the absolute timeliness is unable to be judged, or when the result can not effectively express the user's demand, some redundant records and time limitation rules can be used to determine the relative timeliness of data. In this paper, the relative timeliness determination of data is studied with the aid of redundant records and aging rules, and a relative timeliness determination model is established. In this paper, the correlation algorithm is proposed. Firstly, the validity of query is defined, and the query is reduced to the two classes of the latest value query and the time series query. For each class of queries, the timeliness determination method of the query results is designed, and each class of queries is taken as a whole, and the average timeliness judgment of the data set relative to a class of queries is given. Then, the user is divided into 3 categories according to the query preference, and the user related timeliness is studied. Finally, experiments are carried out on real and virtual data to verify the effectiveness of the algorithm and analyze the influence of the parameters on the algorithm. (3) this paper studies the rule based data aging error repair model and the repair algorithm. It is the key step to improve the quality of data, according to the outdated data in the library. At present, there are two types of data repair methods based on rule and Statistics: rule based repair methods are difficult to express some complex relationships in data, and the statistical method needs to learn more complex conditional probability distribution, and it is difficult. In order to overcome the shortcomings of the two kinds of methods, a new kind of repair rule is proposed in this paper, which combines rules and statistical methods to repair outdated data. On the one hand, the rule can express domain knowledge in a regular pattern, and on the other hand it can be described with its unique distribution table. In this paper, the minimum rule pattern generation problem on static data is studied. It is proved that the rule pattern generation problem on static data is NP- difficult, and two polynomial time approximation algorithms for solving the problem are given. Then, this paper studies the minimum rule pattern generation on dynamic data. The algorithm can quickly update the existing rule pattern set in the case of dynamic change of data. In the best case, it only needs O (1) time to complete the update. At the same time, the distribution table learning algorithm on static data and the distribution table updating algorithm under the dynamic change of data are also given. The problem of optimal repair plan generation under the complex cost constraint proves that the problem can be solved in polynomial time when the repair budget is positive infinity, otherwise the problem is NP- difficult, and the solution of the problem under the two circumstances is given. Finally, this paper proves the above method through the experiments on the real and virtual data sets. (4) 4. In this paper, we study the problem of time dependent error repair based on query. In data integration or Web environment, many data tables are stored in different places. There are often partial data overlaps between these data tables, but the update frequency of different data sources is not the same. If we are to a data source Request a data table or issue a query, often because the data source is not updated in time and can not get the latest data of the target data table. In order to repair the target data table to the latest value, a conjunctive query is constructed based on the time series constraint and the reference integrity constraint in the database, so that the result of the query is exactly the target data. The latest value composition of the table is called the timeliness retention query. This paper studies the query construction problem in the case of a given database timing relationship and reference integrity constraints. First, a formal definition of the time retention query is given in this paper. The query can be used to give the latest value of the target data table. Pattern aging diagram is used to express time series constraints and reference integrity constraints between different data tables in a database, and to express the time preserving query equivalence as an end tree in the graph. Then, this paper formalizes the minimum aging preserving query generation problem. It is proved that the minimum aging maintenance query is a NP- difficult problem, and respectively Finally, the effectiveness of the proposed model and algorithm is verified by experiments.
【學(xué)位授予單位】:哈爾濱工業(yè)大學(xué)
【學(xué)位級(jí)別】:博士
【學(xué)位授予年份】:2016
【分類號(hào)】:TP311.13
,

本文編號(hào):2050107

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/shoufeilunwen/xxkjbs/2050107.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶95252***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請(qǐng)E-mail郵箱bigeng88@qq.com