天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當前位置:主頁 > 科技論文 > 軟件論文 >

基于壓縮數(shù)據(jù)庫的數(shù)據(jù)挖掘算法的研究

發(fā)布時間:2018-09-13 14:23
【摘要】:隨著當前社會經(jīng)濟的繁榮和科學技術(shù)的進步,在各行各業(yè)中積累了大量的數(shù)據(jù)。在科學與統(tǒng)計此類數(shù)據(jù)庫中,存儲著科學實驗結(jié)果、地理測繪、人口普查、經(jīng)濟活動等多種類型的重要數(shù)據(jù),且這些數(shù)據(jù)往往都是靜態(tài)的,錄入數(shù)據(jù)庫之后基本不會發(fā)生改變且會被永久保留。這就導致此類數(shù)據(jù)庫中存儲的數(shù)據(jù)往往是海量的,在傳統(tǒng)數(shù)據(jù)庫上查詢,計算和分析的方法應用在此類數(shù)據(jù)庫上所帶來的I/O傳輸變得十分巨大且難以接受。因此對海量數(shù)據(jù)庫進行壓縮處理成為了一個重要的研究方向。目前數(shù)據(jù)庫領(lǐng)域的學者已經(jīng)提出了許多壓縮數(shù)據(jù)庫的相關(guān)算法。但是,在壓縮數(shù)據(jù)庫上進行數(shù)據(jù)挖掘和分析的相關(guān)研究卻很少。本文的研究內(nèi)容主要是如何在壓縮數(shù)據(jù)庫上進行高效地數(shù)據(jù)挖掘。主要包括以下四個方面:本文根據(jù)科學與統(tǒng)計此類數(shù)據(jù)庫具有靜態(tài)性、稀疏性、聚集性和重復性等特點,提出了一種新的基于Blcok的數(shù)據(jù)庫壓縮算法,并對該算法的壓縮比進行了理論分析。通過實驗與其他數(shù)據(jù)庫壓縮算法進行了對比實驗,證明該壓縮算法在科學與統(tǒng)計數(shù)據(jù)庫上有很高的壓縮比。在關(guān)聯(lián)規(guī)則挖掘上,本文提出了CApriori算法,該算法是一種直接運行在基于Block壓縮方法壓縮后的數(shù)據(jù)庫上的挖掘算法。同時本文對CAPriori算法相比于Apriori算法在時間上的提升進行了理論分析。并且通過實驗驗證了CAPriori算法在壓縮后的科學與統(tǒng)計此類數(shù)據(jù)庫上有更優(yōu)的時間性能。在聚類挖掘上,本文提出了C-kmeans算法,該算法是一種直接運行在壓縮數(shù)據(jù)庫上聚類算法,且該算法是一種改進后的Kmeans算法。因為Kmeans算法運行時間與數(shù)據(jù)記錄是線性相關(guān)的,所以算法運行時間主要消耗在I/O傳輸上。C-Kmeans算法直接讀取壓縮數(shù)據(jù)庫并進行挖掘可以節(jié)省大量的時間。目前存在的事務(wù)數(shù)據(jù)庫垂直數(shù)據(jù)布局上的頻繁模式挖掘算法,會進行大量tidset的交集運算,從而產(chǎn)生大量的中間結(jié)果,這就需要頻繁的外存讀寫。本文針對此問題提出了CONVTV壓縮算法,該壓縮算法對垂直數(shù)據(jù)采用了兩種不同的格式進行保存,在大部分數(shù)據(jù)集上都實現(xiàn)了很高的壓縮比。
[Abstract]:With the prosperity of social economy and the progress of science and technology, a lot of data have been accumulated in various industries. In such databases as science and statistics, there are a variety of important types of data stored in scientific experiments, geographic mapping, censuses, economic activities, etc., which are often static. Entry into the database will not change and will be permanently retained. As a result, the data stored in this kind of database is often massive, and the I / O transmission brought by the methods of query, calculation and analysis on the traditional database becomes very large and difficult to accept. Therefore, the compression of massive databases has become an important research direction. At present, many related algorithms of compressed database have been proposed by scholars in database field. However, there are few researches on data mining and analysis on compressed database. The main research content of this paper is how to mine data efficiently on compressed database. The main contents are as follows: according to the static, sparse, aggregation and repeatability of scientific and statistical databases, a new database compression algorithm based on Blcok is proposed in this paper. The compression ratio of the algorithm is analyzed theoretically. The experimental results show that the compression algorithm has a high compression ratio in scientific and statistical databases. In the mining of association rules, this paper presents the CApriori algorithm, which is a mining algorithm which runs directly on the compressed database based on the Block compression method. At the same time, the CAPriori algorithm compared with the Apriori algorithm in the time of the promotion of theoretical analysis. The experimental results show that the CAPriori algorithm has better time performance in compressed scientific and statistical databases. In clustering mining, this paper proposes C-kmeans algorithm, which is a clustering algorithm running directly on compressed database, and this algorithm is an improved Kmeans algorithm. Because the running time of the Kmeans algorithm is linearly related to the data record, the running time of the algorithm is mainly consumed on the I / O transmission. C-K means algorithm can save a lot of time by reading the compressed database directly and mining it. The existing algorithms for mining frequent patterns in vertical data layout of transaction databases will perform a large number of tidset intersection operations, resulting in a large number of intermediate results, which requires frequent external memory reading and writing. In this paper, CONVTV compression algorithm is proposed to solve this problem. This compression algorithm uses two different formats to save vertical data and achieves a high compression ratio on most data sets.
【學位授予單位】:哈爾濱工業(yè)大學
【學位級別】:碩士
【學位授予年份】:2017
【分類號】:TP311.13

【參考文獻】

相關(guān)期刊論文 前2條

1 孫志長;馮祖洪;王沛棟;;一種高效的混合壓縮數(shù)據(jù)挖掘算法[J];計算機應用研究;2009年10期

2 高宏,李建中;超大型壓縮數(shù)據(jù)倉庫上的CUBE算法[J];黑龍江大學自然科學學報;1999年04期

,

本文編號:2241424

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2241424.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶03d20***提供,本站僅收錄摘要或目錄,作者需要刪除請E-mail郵箱bigeng88@qq.com