基于Hadoop平臺的并行化分布式關(guān)聯(lián)規(guī)則挖掘算法研究

發(fā)布時間：2018-02-04 21:11

本文關(guān)鍵詞： 關(guān)聯(lián)規(guī)則挖掘算法數(shù)據(jù)挖掘的并行化 Apriori算法 Hadoop　出處：《吉林大學(xué)》2017年碩士論文　論文類型：學(xué)位論文

【摘要】：隨著近些年科學(xué)技術(shù)的飛速發(fā)展,人們?nèi)粘Ｉ钪型ㄟ^計算機、手機等終端平臺進行的一系列行為都會產(chǎn)生大量的數(shù)據(jù),而產(chǎn)生數(shù)據(jù)、獲取數(shù)據(jù)的方式也在與日俱增。在當(dāng)今這個數(shù)據(jù)時代的大背景下,各種數(shù)據(jù)都以急速的勢態(tài)不斷增長,能夠達到日產(chǎn)數(shù)據(jù)量幾百TB乃至PB級別的大型網(wǎng)絡(luò)企業(yè)屢見不鮮。如何從如此龐大的數(shù)據(jù)庫中快速、高效、準確地獲取信息,是現(xiàn)今計算機科學(xué)研究的熱點之一。并行化分布式挖掘算法是針對可能存在的跨地域的海量數(shù)據(jù)進行分析的一種重要手段,具有非常重要的研究意義和實用價值。關(guān)聯(lián)規(guī)則挖掘算法是經(jīng)典的數(shù)據(jù)挖掘算法之一,具有很強的學(xué)習(xí)價值和參考價值。傳統(tǒng)的關(guān)聯(lián)規(guī)則挖掘算法會將候選集一一緩存輸出,在并行化的前提下還要進行網(wǎng)絡(luò)交換。但是在大數(shù)據(jù)量的背景下,生成的候選項目集會出現(xiàn)暴增的情況,容易對機器的內(nèi)存造成負擔(dān),影響算法的效率。針對算法原有的缺陷,本文提出一種優(yōu)化算法Y-IDA算法,直接在內(nèi)存中將合并計數(shù)的過程完成,替代傳統(tǒng)的將候選集逐一輸出的方法來優(yōu)化算法,同時修改Hadoop接口,改變Map Reduce的讀入模式,利用生成的首個頻繁項集對數(shù)據(jù)庫進行清洗,降低了內(nèi)存消耗和CPU占用時間,提高了算法的執(zhí)行效率。本文主要工作包括:1)實現(xiàn)基本算法串行Apriori,為后續(xù)并行化打下基礎(chǔ);2)針對并行化的Apriori算法提出了優(yōu)化算法Y-IDA,該算法在內(nèi)存中將合并計數(shù)的的過程完成,替代傳統(tǒng)的將候選集逐一輸出的方法,同時改變Map Reduce傳統(tǒng)的讀入模式,減少執(zhí)行過程中的通訊量,并且在生成候選1項集后對數(shù)據(jù)進行清洗,去除無效數(shù)據(jù);3)在Hadoop平臺上實現(xiàn)關(guān)聯(lián)規(guī)則算法的并行化,在現(xiàn)有的實驗條件下提出實驗方案,驗證了Y-IDA算法的結(jié)果與經(jīng)典算法相同,分別在時間效率、內(nèi)存消耗、磁盤讀寫、CPU占用等方面進行詳細比對。結(jié)合本文工作,通過Hadoop完全分布式平臺,采用數(shù)據(jù)挖掘離散測試數(shù)據(jù)進行實現(xiàn),可以得到的結(jié)果是:改進后的算法可以縮短執(zhí)行時間,在內(nèi)存消耗、CPU占用、磁盤I/O讀寫方面都有較好的表現(xiàn),得到改進的算法具有可行性和普遍意義的結(jié)論。
[Abstract]:With the rapid development of science and technology in recent years, people's daily life through the computer, mobile phone and other terminal platform to carry out a series of behaviors will produce a lot of data, and produce data. The way to get data is also increasing. In the background of this data age, all kinds of data are growing rapidly. It is common for large network enterprises to reach the daily output of several hundred terabytes or even PB. How to obtain information quickly, efficiently and accurately from such a huge database. Parallel distributed mining algorithm is an important method to analyze the large amount of data that may exist across different regions. Association rules mining algorithm is one of the classical data mining algorithms. The traditional association rule mining algorithm will cache the candidate set one by one and exchange the candidate set in parallel. But in the context of large amount of data. Because of the explosion of candidate project assembly, it is easy to burden the memory of the machine and affect the efficiency of the algorithm. In view of the original defects of the algorithm, this paper proposes an optimization algorithm Y-IDA algorithm. The process of merging and counting is completed directly in memory, instead of the traditional method of outputting candidate sets one by one to optimize the algorithm. At the same time, the Hadoop interface is modified to change the readin mode of Map Reduce. The first frequent itemset is used to clean the database, which reduces memory consumption and CPU time. The main work of this paper includes: 1) realizing the basic algorithm serially Apriori. which lays the foundation for the subsequent parallelization; 2) for the parallel Apriori algorithm, an optimization algorithm Y-IDA is proposed, which completes the process of merging count in memory, replacing the traditional method of outputting candidate sets one by one. At the same time, the traditional read-in mode of Map Reduce is changed to reduce the communication in the execution process, and the data is cleaned after the candidate set is generated to remove the invalid data. 3) the parallelization of association rule algorithm is realized on Hadoop platform, and the experimental scheme is proposed under the existing experimental conditions. The result of Y-IDA algorithm is the same as that of classical algorithm, and the time efficiency of Y-IDA algorithm is respectively in time efficiency. Memory consumption, disk read and write CPU usage and other aspects are compared in detail. Combined with the work of this paper, through the Hadoop completely distributed platform, data mining discrete test data are implemented. The results are as follows: the improved algorithm can shorten the execution time and has good performance in memory consumption CPU consumption disk I / O reading and writing. The conclusion that the improved algorithm is feasible and universal is obtained.
【學(xué)位授予單位】：吉林大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2017
【分類號】：TP311.13

【參考文獻】

相關(guān)期刊論文前10條

1 魏玲;魏永江;高長元;;基于Bigtable與MapReduce的Apriori算法改進[J];計算機科學(xué);2015年10期

2 李志杰;李元香;王峰;何國良;匡立;;面向大數(shù)據(jù)分析的在線學(xué)習(xí)算法綜述[J];計算機研究與發(fā)展;2015年08期

3 魯法明;曾慶田;段華;程久軍;包云霞;;一種并行化的啟發(fā)式流程挖掘算法[J];軟件學(xué)報;2015年03期

4 李學(xué)龍;龔海剛;;大數(shù)據(jù)系統(tǒng)綜述[J];中國科學(xué):信息科學(xué);2015年01期

5 周發(fā)超;王志堅;葉楓;鄧玲玲;;關(guān)聯(lián)規(guī)則挖掘算法Apriori的研究改進[J];計算機科學(xué)與探索;2015年09期

6 郭遲;劉經(jīng)南;方媛;羅夢;崔競松;;位置大數(shù)據(jù)的價值提取與協(xié)同挖掘方法[J];軟件學(xué)報;2014年04期

7 郝曉飛;譚躍生;王靜宇;;Hadoop平臺上Apriori算法并行化研究與實現(xiàn)[J];計算機與現(xiàn)代化;2013年03期

8 林旺群;盧風(fēng)順;丁兆云;吳泉源;周斌;賈焰;;基于帶權(quán)圖的層次化社區(qū)并行計算方法[J];軟件學(xué)報;2012年06期

9 李建江;崔健;王聃;嚴林;黃義雙;;MapReduce并行編程模型研究綜述[J];電子學(xué)報;2011年11期

10 王珊;王會舉;覃雄派;周p，

本文編號：1491151

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/shoufeilunwen/xixikjs/1491151.html

上一篇：曲線細分算法的構(gòu)造及連續(xù)性分析
下一篇：廬山旅游APP的深度開發(fā)與設(shè)計

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于Hadoop平臺的并行化分布式關(guān)聯(lián)規(guī)則挖掘算法研究