面向不確定數(shù)據(jù)流的Top-k查詢處理
發(fā)布時(shí)間:2019-03-16 10:13
【摘要】:不確定數(shù)據(jù)廣泛存在于信息社會(huì)的各個(gè)領(lǐng)域之中,包括金融、軍事、位置服務(wù)、醫(yī)療以及氣象等。隨著移動(dòng)互聯(lián)網(wǎng)的快速普及以及新型數(shù)據(jù)采集技術(shù)的不斷問世,不確定數(shù)據(jù)的規(guī)模急遽增長(zhǎng)。因此,不確定數(shù)據(jù)管理技術(shù)受到了學(xué)術(shù)界與工業(yè)界研究人員的共同關(guān)注。數(shù)據(jù)不確定性出現(xiàn)在關(guān)系數(shù)據(jù)、半結(jié)構(gòu)化數(shù)據(jù)、數(shù)據(jù)流以及多維數(shù)據(jù)之中。本文研究如何解決不確定數(shù)據(jù)流的Top-k查詢處理。不確定數(shù)據(jù)流是一個(gè)高速到達(dá)的海量不確定數(shù)據(jù)元組序列,主要處理的難點(diǎn)有:(1)數(shù)據(jù)流到達(dá)速率極快,必須及時(shí)進(jìn)行處理;(2)數(shù)據(jù)規(guī)模潛在無限,往往無法將全部數(shù)據(jù)存放在內(nèi)存之中;(3)由于概率的存在,需要設(shè)計(jì)高效的優(yōu)化算法,來降低計(jì)算成本。目前,雖然學(xué)術(shù)界已經(jīng)積累了眾多的研究成果,但現(xiàn)有方法在應(yīng)對(duì)具體場(chǎng)景時(shí)仍存在局限性,因此亟需開發(fā)新型不確定數(shù)據(jù)流管理技術(shù)。本文提出了一種新型的不確定數(shù)據(jù)流近似查詢算法,可以處理不確定數(shù)據(jù)流的ER-Topk與TTk查詢問題。此外,為了實(shí)現(xiàn)數(shù)據(jù)流吞吐與查詢響應(yīng)的雙重性能提升,我們?cè)O(shè)計(jì)出了一套通用的不確定數(shù)據(jù)流的查詢處理框架。本文的工作主要包括以下幾個(gè)方面:海量數(shù)據(jù)流近似查詢算法解決了目前不確定數(shù)據(jù)流在處理ER-Topk與TTk查詢時(shí)所遇到的存儲(chǔ)空間消耗過大的問題。該算法可以有效地對(duì)到達(dá)的不確定數(shù)據(jù)流進(jìn)行過濾處理,在控制數(shù)據(jù)精度的情況下減少數(shù)據(jù)處理壓力,提升系統(tǒng)的整體性能。實(shí)時(shí)不確定數(shù)據(jù)流處理框架基于近似算法提出一種針對(duì)于處理ER-Topk與TTk的數(shù)據(jù)流批處理框架?蚣懿捎貌⑿刑幚砑夹g(shù)以實(shí)現(xiàn)對(duì)不斷快速到達(dá)數(shù)據(jù)的高吞吐處理。數(shù)據(jù)流誤差檢測(cè)不確定數(shù)據(jù)流往往由于各種因素的影響而存在錯(cuò)誤信息。為了避免錯(cuò)誤數(shù)據(jù)對(duì)查詢結(jié)果產(chǎn)生嚴(yán)重影響,本文提出了一種錯(cuò)誤數(shù)據(jù)檢測(cè)方法,通過對(duì)數(shù)據(jù)特征的分析實(shí)現(xiàn)異常判斷。框架的有效性驗(yàn)證本文提出的近似算法與框架旨在解決不確定數(shù)據(jù)流上的ER-Topk與TTk查詢。為了驗(yàn)證算法與框架的數(shù)據(jù)吞吐能力、可靠性以及查詢響應(yīng)速率,本文通過設(shè)計(jì)不同的實(shí)驗(yàn)策略,結(jié)合模擬數(shù)據(jù)與真實(shí)數(shù)據(jù)來檢測(cè)算法與框架的真實(shí)表現(xiàn)。
[Abstract]:Uncertain data exist widely in all fields of the information society, including finance, military, location services, medical care, meteorology and so on. With the rapid popularization of mobile Internet and the advent of new data acquisition technology, the scale of uncertain data increases rapidly. Therefore, uncertain data management technology has been concerned by researchers both in academia and industry. Data uncertainty occurs in relational data, semi-structured data, data streams, and multidimensional data. In this paper, we study how to solve the Top-k query processing of uncertain data streams. Uncertain data flow is a large number of uncertain data tuples which arrive at a high speed. The main difficulties of data flow processing are: (1) the arrival rate of data stream is very fast and must be processed in time; (2) the scale of data is potentially infinite and it is often impossible to store all the data in memory; (3) because of the existence of probability, it is necessary to design an efficient optimization algorithm to reduce the computation cost. At present, although the academic circles have accumulated a lot of research results, the existing methods still have limitations in dealing with specific scenarios, so it is urgent to develop a new type of uncertain data flow management technology. In this paper, a new approximate query algorithm for uncertain data streams is proposed, which can deal with the ER-Topk and TTk queries of uncertain data streams. In addition, in order to improve the performance of data stream throughput and query response, we design a general query processing framework for uncertain data streams. The work of this paper mainly includes the following aspects: the approximate query algorithm for massive data streams solves the problem that the uncertain data streams consume too much storage space when dealing with ER-Topk and TTk queries. The algorithm can filter the uncertain data flow effectively, reduce the pressure of data processing and improve the overall performance of the system under the condition of controlling the data precision. A real-time uncertain data stream processing framework based on approximate algorithm is proposed to deal with ER-Topk and TTk data stream batch processing framework. Parallel processing technology is used in the framework to realize high throughput processing of fast reaching data. Data flow error detection uncertainty data flow is often due to the influence of various factors and there are error messages. In order to avoid the serious influence of the error data on the query result, this paper proposes a method of error data detection, which realizes abnormal judgment by analyzing the characteristics of the data. The validity of the framework validates the approximate algorithm and framework proposed in this paper to solve the ER-Topk and TTk queries on uncertain data streams. In order to verify the data throughput, reliability and query response rate of the algorithm and the framework, this paper designs different experimental strategies to detect the real performance of the algorithm and the framework by combining the simulated data and the real data.
【學(xué)位授予單位】:華東師范大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:TP311.13
本文編號(hào):2441163
[Abstract]:Uncertain data exist widely in all fields of the information society, including finance, military, location services, medical care, meteorology and so on. With the rapid popularization of mobile Internet and the advent of new data acquisition technology, the scale of uncertain data increases rapidly. Therefore, uncertain data management technology has been concerned by researchers both in academia and industry. Data uncertainty occurs in relational data, semi-structured data, data streams, and multidimensional data. In this paper, we study how to solve the Top-k query processing of uncertain data streams. Uncertain data flow is a large number of uncertain data tuples which arrive at a high speed. The main difficulties of data flow processing are: (1) the arrival rate of data stream is very fast and must be processed in time; (2) the scale of data is potentially infinite and it is often impossible to store all the data in memory; (3) because of the existence of probability, it is necessary to design an efficient optimization algorithm to reduce the computation cost. At present, although the academic circles have accumulated a lot of research results, the existing methods still have limitations in dealing with specific scenarios, so it is urgent to develop a new type of uncertain data flow management technology. In this paper, a new approximate query algorithm for uncertain data streams is proposed, which can deal with the ER-Topk and TTk queries of uncertain data streams. In addition, in order to improve the performance of data stream throughput and query response, we design a general query processing framework for uncertain data streams. The work of this paper mainly includes the following aspects: the approximate query algorithm for massive data streams solves the problem that the uncertain data streams consume too much storage space when dealing with ER-Topk and TTk queries. The algorithm can filter the uncertain data flow effectively, reduce the pressure of data processing and improve the overall performance of the system under the condition of controlling the data precision. A real-time uncertain data stream processing framework based on approximate algorithm is proposed to deal with ER-Topk and TTk data stream batch processing framework. Parallel processing technology is used in the framework to realize high throughput processing of fast reaching data. Data flow error detection uncertainty data flow is often due to the influence of various factors and there are error messages. In order to avoid the serious influence of the error data on the query result, this paper proposes a method of error data detection, which realizes abnormal judgment by analyzing the characteristics of the data. The validity of the framework validates the approximate algorithm and framework proposed in this paper to solve the ER-Topk and TTk queries on uncertain data streams. In order to verify the data throughput, reliability and query response rate of the algorithm and the framework, this paper designs different experimental strategies to detect the real performance of the algorithm and the framework by combining the simulated data and the real data.
【學(xué)位授予單位】:華東師范大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:TP311.13
【參考文獻(xiàn)】
相關(guān)期刊論文 前4條
1 肖丹萍;葉東毅;;基于免疫原理的不確定數(shù)據(jù)流聚類算法[J];模式識(shí)別與人工智能;2012年05期
2 李文鳳;彭智勇;李德毅;;不確定性Top-K查詢處理[J];軟件學(xué)報(bào);2012年06期
3 張晨;金澈清;周傲英;;一種不確定數(shù)據(jù)流聚類算法[J];軟件學(xué)報(bào);2010年09期
4 周傲英;金澈清;王國(guó)仁;李建中;;不確定性數(shù)據(jù)管理技術(shù)研究綜述[J];計(jì)算機(jī)學(xué)報(bào);2009年01期
相關(guān)博士學(xué)位論文 前2條
1 侯東風(fēng);流式數(shù)據(jù)多維建模與查詢關(guān)鍵技術(shù)研究[D];國(guó)防科學(xué)技術(shù)大學(xué);2010年
2 劉青寶;模糊、動(dòng)態(tài)多維數(shù)據(jù)建模理論與方法研究[D];國(guó)防科學(xué)技術(shù)大學(xué);2006年
,本文編號(hào):2441163
本文鏈接:http://sikaile.net/shoufeilunwen/xixikjs/2441163.html
最近更新
教材專著