天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于Hadoop平臺(tái)的并行化分布式關(guān)聯(lián)規(guī)則挖掘算法研究

發(fā)布時(shí)間:2018-02-04 21:11

  本文關(guān)鍵詞: 關(guān)聯(lián)規(guī)則挖掘算法 數(shù)據(jù)挖掘的并行化 Apriori算法 Hadoop 出處:《吉林大學(xué)》2017年碩士論文 論文類(lèi)型:學(xué)位論文


【摘要】:隨著近些年科學(xué)技術(shù)的飛速發(fā)展,人們?nèi)粘I钪型ㄟ^(guò)計(jì)算機(jī)、手機(jī)等終端平臺(tái)進(jìn)行的一系列行為都會(huì)產(chǎn)生大量的數(shù)據(jù),而產(chǎn)生數(shù)據(jù)、獲取數(shù)據(jù)的方式也在與日俱增。在當(dāng)今這個(gè)數(shù)據(jù)時(shí)代的大背景下,各種數(shù)據(jù)都以急速的勢(shì)態(tài)不斷增長(zhǎng),能夠達(dá)到日產(chǎn)數(shù)據(jù)量幾百TB乃至PB級(jí)別的大型網(wǎng)絡(luò)企業(yè)屢見(jiàn)不鮮。如何從如此龐大的數(shù)據(jù)庫(kù)中快速、高效、準(zhǔn)確地獲取信息,是現(xiàn)今計(jì)算機(jī)科學(xué)研究的熱點(diǎn)之一。并行化分布式挖掘算法是針對(duì)可能存在的跨地域的海量數(shù)據(jù)進(jìn)行分析的一種重要手段,具有非常重要的研究意義和實(shí)用價(jià)值。關(guān)聯(lián)規(guī)則挖掘算法是經(jīng)典的數(shù)據(jù)挖掘算法之一,具有很強(qiáng)的學(xué)習(xí)價(jià)值和參考價(jià)值。傳統(tǒng)的關(guān)聯(lián)規(guī)則挖掘算法會(huì)將候選集一一緩存輸出,在并行化的前提下還要進(jìn)行網(wǎng)絡(luò)交換。但是在大數(shù)據(jù)量的背景下,生成的候選項(xiàng)目集會(huì)出現(xiàn)暴增的情況,容易對(duì)機(jī)器的內(nèi)存造成負(fù)擔(dān),影響算法的效率。針對(duì)算法原有的缺陷,本文提出一種優(yōu)化算法Y-IDA算法,直接在內(nèi)存中將合并計(jì)數(shù)的過(guò)程完成,替代傳統(tǒng)的將候選集逐一輸出的方法來(lái)優(yōu)化算法,同時(shí)修改Hadoop接口,改變Map Reduce的讀入模式,利用生成的首個(gè)頻繁項(xiàng)集對(duì)數(shù)據(jù)庫(kù)進(jìn)行清洗,降低了內(nèi)存消耗和CPU占用時(shí)間,提高了算法的執(zhí)行效率。本文主要工作包括:1)實(shí)現(xiàn)基本算法串行Apriori,為后續(xù)并行化打下基礎(chǔ);2)針對(duì)并行化的Apriori算法提出了優(yōu)化算法Y-IDA,該算法在內(nèi)存中將合并計(jì)數(shù)的的過(guò)程完成,替代傳統(tǒng)的將候選集逐一輸出的方法,同時(shí)改變Map Reduce傳統(tǒng)的讀入模式,減少執(zhí)行過(guò)程中的通訊量,并且在生成候選1項(xiàng)集后對(duì)數(shù)據(jù)進(jìn)行清洗,去除無(wú)效數(shù)據(jù);3)在Hadoop平臺(tái)上實(shí)現(xiàn)關(guān)聯(lián)規(guī)則算法的并行化,在現(xiàn)有的實(shí)驗(yàn)條件下提出實(shí)驗(yàn)方案,驗(yàn)證了Y-IDA算法的結(jié)果與經(jīng)典算法相同,分別在時(shí)間效率、內(nèi)存消耗、磁盤(pán)讀寫(xiě)、CPU占用等方面進(jìn)行詳細(xì)比對(duì)。結(jié)合本文工作,通過(guò)Hadoop完全分布式平臺(tái),采用數(shù)據(jù)挖掘離散測(cè)試數(shù)據(jù)進(jìn)行實(shí)現(xiàn),可以得到的結(jié)果是:改進(jìn)后的算法可以縮短執(zhí)行時(shí)間,在內(nèi)存消耗、CPU占用、磁盤(pán)I/O讀寫(xiě)方面都有較好的表現(xiàn),得到改進(jìn)的算法具有可行性和普遍意義的結(jié)論。
[Abstract]:With the rapid development of science and technology in recent years, people's daily life through the computer, mobile phone and other terminal platform to carry out a series of behaviors will produce a lot of data, and produce data. The way to get data is also increasing. In the background of this data age, all kinds of data are growing rapidly. It is common for large network enterprises to reach the daily output of several hundred terabytes or even PB. How to obtain information quickly, efficiently and accurately from such a huge database. Parallel distributed mining algorithm is an important method to analyze the large amount of data that may exist across different regions. Association rules mining algorithm is one of the classical data mining algorithms. The traditional association rule mining algorithm will cache the candidate set one by one and exchange the candidate set in parallel. But in the context of large amount of data. Because of the explosion of candidate project assembly, it is easy to burden the memory of the machine and affect the efficiency of the algorithm. In view of the original defects of the algorithm, this paper proposes an optimization algorithm Y-IDA algorithm. The process of merging and counting is completed directly in memory, instead of the traditional method of outputting candidate sets one by one to optimize the algorithm. At the same time, the Hadoop interface is modified to change the readin mode of Map Reduce. The first frequent itemset is used to clean the database, which reduces memory consumption and CPU time. The main work of this paper includes: 1) realizing the basic algorithm serially Apriori. which lays the foundation for the subsequent parallelization; 2) for the parallel Apriori algorithm, an optimization algorithm Y-IDA is proposed, which completes the process of merging count in memory, replacing the traditional method of outputting candidate sets one by one. At the same time, the traditional read-in mode of Map Reduce is changed to reduce the communication in the execution process, and the data is cleaned after the candidate set is generated to remove the invalid data. 3) the parallelization of association rule algorithm is realized on Hadoop platform, and the experimental scheme is proposed under the existing experimental conditions. The result of Y-IDA algorithm is the same as that of classical algorithm, and the time efficiency of Y-IDA algorithm is respectively in time efficiency. Memory consumption, disk read and write CPU usage and other aspects are compared in detail. Combined with the work of this paper, through the Hadoop completely distributed platform, data mining discrete test data are implemented. The results are as follows: the improved algorithm can shorten the execution time and has good performance in memory consumption CPU consumption disk I / O reading and writing. The conclusion that the improved algorithm is feasible and universal is obtained.
【學(xué)位授予單位】:吉林大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類(lèi)號(hào)】:TP311.13

【參考文獻(xiàn)】

相關(guān)期刊論文 前10條

1 魏玲;魏永江;高長(zhǎng)元;;基于Bigtable與MapReduce的Apriori算法改進(jìn)[J];計(jì)算機(jī)科學(xué);2015年10期

2 李志杰;李元香;王峰;何國(guó)良;匡立;;面向大數(shù)據(jù)分析的在線學(xué)習(xí)算法綜述[J];計(jì)算機(jī)研究與發(fā)展;2015年08期

3 魯法明;曾慶田;段華;程久軍;包云霞;;一種并行化的啟發(fā)式流程挖掘算法[J];軟件學(xué)報(bào);2015年03期

4 李學(xué)龍;龔海剛;;大數(shù)據(jù)系統(tǒng)綜述[J];中國(guó)科學(xué):信息科學(xué);2015年01期

5 周發(fā)超;王志堅(jiān);葉楓;鄧玲玲;;關(guān)聯(lián)規(guī)則挖掘算法Apriori的研究改進(jìn)[J];計(jì)算機(jī)科學(xué)與探索;2015年09期

6 郭遲;劉經(jīng)南;方媛;羅夢(mèng);崔競(jìng)松;;位置大數(shù)據(jù)的價(jià)值提取與協(xié)同挖掘方法[J];軟件學(xué)報(bào);2014年04期

7 郝曉飛;譚躍生;王靜宇;;Hadoop平臺(tái)上Apriori算法并行化研究與實(shí)現(xiàn)[J];計(jì)算機(jī)與現(xiàn)代化;2013年03期

8 林旺群;盧風(fēng)順;丁兆云;吳泉源;周斌;賈焰;;基于帶權(quán)圖的層次化社區(qū)并行計(jì)算方法[J];軟件學(xué)報(bào);2012年06期

9 李建江;崔健;王聃;嚴(yán)林;黃義雙;;MapReduce并行編程模型研究綜述[J];電子學(xué)報(bào);2011年11期

10 王珊;王會(huì)舉;覃雄派;周p,

本文編號(hào):1491151


資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/shoufeilunwen/xixikjs/1491151.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶4ccba***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請(qǐng)E-mail郵箱bigeng88@qq.com