基于模糊查詢的大數(shù)據(jù)分析處理系統(tǒng)的研究與實現(xiàn)

發(fā)布時間：2018-05-07 06:54

本文選題：在線聚集 + 樣本��；參考：《浙江大學》2017年碩士論文

【摘要】：隨著大數(shù)據(jù)分析技術的日漸成熟,大數(shù)據(jù)所蘊含的巨大價值已經(jīng)越來越被重視。由于數(shù)據(jù)量巨大,對大數(shù)據(jù)進行分析一般是很耗費時間的。然而,在很多情況下,用戶并不需要精確的查詢結果,數(shù)據(jù)大概的輪廓就可以滿足大部分的分析需求。本文研究并實現(xiàn)了一種基于模糊查詢的大數(shù)據(jù)分析處理系統(tǒng)。該系統(tǒng)為用戶定義了一套查詢接口,這些接口支持用戶進行各種聚集查詢(Group By)。系統(tǒng)將會為用戶查詢返回一個模糊結果。本系統(tǒng)可以在秒級內返回上百G數(shù)據(jù)的模糊查詢結果。利用在線聚集技術可以快速生成數(shù)據(jù)輪廓的特點,本文將在線聚集技術應用到了系統(tǒng)中。同時,系統(tǒng)中相鄰查詢得到的結果集是有交疊的,如果能夠將系統(tǒng)已經(jīng)處理的查詢所采集到的樣本和計算出的中間結果保存起來,就可以加速系統(tǒng)處理后面查詢的速度�；诖�,本文對在線聚集技術做了優(yōu)化。首先,本文對數(shù)據(jù)集進行隨機化處理,生成一個隨機數(shù)據(jù)集,這樣,就可以通過順序掃描隨機數(shù)據(jù)集來達到在數(shù)據(jù)集中隨機采樣的效果。然后,本文通過在線聚集技術處理用戶的查詢請求。在線聚集技術在生成查詢結果的同時,會把已經(jīng)獲取的樣本和產生的中間結果存儲在一棵樣本管理樹中。相應的,用戶的查詢也會首先在這棵樹中進行處理。當在樹中查詢到的結果不能滿足用戶的需求時,系統(tǒng)再從數(shù)據(jù)源讀取數(shù)據(jù)。通過這種方式,在線聚集技術中采取的樣本和中間結果可以有效地被多個查詢使用。同時,本文還提供了一種整合多個中間結果的方法,以生成最終查詢結果。最后,通過在TPC-H基準上的實驗結果,驗證了本文所設計并實現(xiàn)的系統(tǒng)的有效性。
[Abstract]:With the maturation of big data's analytical technology, the great value contained by big data has been paid more and more attention. Because of the huge amount of data, big data is generally very time-consuming analysis. However, in many cases, users do not need accurate query results, the profile of the data can meet most of the analysis requirements. This paper studies and implements a big data analysis and processing system based on fuzzy query. The system defines a set of query interfaces for users. The system will return a fuzzy result for the user query. The system can return the fuzzy query results of hundreds of gigabytes in seconds. In this paper, the on-line aggregation technique is applied to the system. At the same time, the result sets of the adjacent queries in the system are overlapped. If we can save the samples collected from the queries processed by the system and the intermediate results calculated, we can speed up the processing of the later queries. Based on this, this paper optimizes the technique of online aggregation. First, the data set is randomly processed to generate a random data set, so that the random data set can be scanned sequentially to achieve the effect of random sampling in the data set. Then, this paper deals with the query request of the user through the online aggregation technology. While generating query results, the online aggregation technique stores the obtained samples and the generated intermediate results in a sample management tree. Accordingly, the user's query is first processed in this tree. When the query results in the tree can not meet the needs of the user, the system reads the data from the data source. In this way, the samples and intermediate results taken in the online aggregation technique can be effectively used by multiple queries. At the same time, this paper also provides a method to integrate multiple intermediate results to generate the final query results. Finally, the effectiveness of the system designed and implemented in this paper is verified by the experimental results on the TPC-H benchmark.
【學位授予單位】：浙江大學
【學位級別】：碩士
【學位授予年份】：2017
【分類號】：TP311.13

【參考文獻】

相關期刊論文前4條

1 汪鳳鳴;慈祥;孟小峰;;云環(huán)境下的Max/Min在線聚集技術研究[J];小型微型計算機系統(tǒng);2015年10期

2 安明遠;孫秀明;孫凝暉;;動態(tài)分片在線聚集[J];計算機研究與發(fā)展;2010年11期

3 韓希先;楊東華;李建中;;海量數(shù)據(jù)上的近似連接聚集操作[J];計算機學報;2010年10期

4 程思瑤;姜守旭;李建中;;P2P網(wǎng)絡中時變數(shù)據(jù)的近似聚集方法[J];軟件學報;2009年07期

，

本文編號：1855864

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1855864.html

上一篇：用戶日常頻繁行為模式挖掘
下一篇：基于RASM的緊支撐徑向基函數(shù)自適應并行地形插值方法

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于模糊查詢的大數(shù)據(jù)分析處理系統(tǒng)的研究與實現(xiàn)