基于Hadoop平臺并行關(guān)聯(lián)規(guī)則挖掘算法研究
本文選題:大數(shù)據(jù) + 關(guān)聯(lián)規(guī)則 ; 參考:《西安科技大學(xué)》2017年碩士論文
【摘要】:數(shù)據(jù)規(guī)模的爆炸性增長給傳統(tǒng)計(jì)算機(jī)技術(shù)和串行算法帶來挑戰(zhàn),同時(shí)也帶來了新的發(fā)展機(jī)遇!按髷(shù)據(jù)”順應(yīng)而生。大數(shù)據(jù)使串行化關(guān)聯(lián)規(guī)則算法需要重寫,串行算法的并行化迫在眉睫,并行計(jì)算和大數(shù)據(jù)平臺的應(yīng)用是好的解決方案。關(guān)聯(lián)規(guī)則用于發(fā)現(xiàn)信息與信息之間存在的關(guān)系,是重要的數(shù)據(jù)挖掘任務(wù)。關(guān)聯(lián)規(guī)則傳統(tǒng)算法Apriori算法和FP-Growth算法處理大數(shù)據(jù)時(shí),單機(jī)處理發(fā)生內(nèi)存溢出情況。使用Hadoop進(jìn)行關(guān)聯(lián)規(guī)則研究,降低編程難度,數(shù)據(jù)分片,因此Hadoop上關(guān)聯(lián)規(guī)則并行算法研究是一個(gè)重要課題。針對此問題,本文進(jìn)行了如下研究:(l)研究了 H-Apriori(Apriori algorithm based on Hadoop)算法并改進(jìn)其算法。大數(shù)據(jù)環(huán)境下,Apriori串行算法難以處理海量數(shù)據(jù),H-Apriori算法的中間過程產(chǎn)生大量值為1的鍵/值對,并且讀取全部的事務(wù),以致產(chǎn)生了大量的候選項(xiàng)并消耗了運(yùn)算時(shí)間。本文采用刪除非頻繁項(xiàng)達(dá)到減少冗余數(shù)據(jù)的目的。重構(gòu)數(shù)據(jù)庫,優(yōu)化讀取事務(wù)步驟,提出了基于Hadoop的改進(jìn)算法。有效約簡了事務(wù)數(shù)據(jù)庫,使用哈希樹計(jì)數(shù)減少計(jì)數(shù)時(shí)間,提高了算法效率。(2)提出了一種基于Hadoop平臺的負(fù)載均衡數(shù)據(jù)分割FP-Growth的改進(jìn)算法。大數(shù)據(jù)環(huán)境下,FP-Growth串行算法難以處理海量數(shù)據(jù),PFP(ParallelFP-Growth)難以處理一定量的數(shù)據(jù)。改進(jìn)算法使用負(fù)載量估計(jì)、改進(jìn)的均衡化分組方法進(jìn)行均衡化分組,克服了 PFP數(shù)據(jù)量增大不能處理、負(fù)載不均衡的缺點(diǎn)。改進(jìn)算法可以有效平衡集群各節(jié)點(diǎn)的負(fù)載,縮短整個(gè)集群的算法運(yùn)行時(shí)間。搭建大數(shù)據(jù)Hadoop平臺框架后,進(jìn)行了對比實(shí)驗(yàn)。通過權(quán)威數(shù)據(jù)驗(yàn)證算法實(shí)效性。實(shí)驗(yàn)表明,改進(jìn)算法能夠更好的適應(yīng)大數(shù)據(jù),并且效率較高。
[Abstract]:The explosive growth of data scale brings challenges to traditional computer technology and serial algorithms, but also brings new opportunities for development. "big data" comes with adaptation. The serialized association rule algorithm needs to be rewritten by big data, and the parallelization of serial algorithm is imminent. Parallel computing and big data platform are good solutions. Association rules are used to discover the relationship between information and information, which is an important task of data mining. When Apriori algorithm and FP-Growth algorithm deal with big data, memory overflow occurs on single machine. Using Hadoop to study association rules reduces the difficulty of programming and divides data into pieces. Therefore the research on parallel algorithms of association rules on Hadoop is an important subject. In order to solve this problem, this paper researches as follows: (l) studies H-Apriori (Apriori algorithm based on Hadoop algorithm and improves its algorithm. In big data environment, it is difficult to deal with massive data in the middle process of H-Apriori algorithm, which produces a large number of key / value pairs with a value of 1, and reads all transactions, resulting in a large number of candidate items and consuming operation time. In this paper, we reduce redundant data by deleting infrequent items. The improved algorithm based on Hadoop is proposed to reconstruct the database and optimize the step of reading transaction. The transaction database is reduced effectively and the counting time is reduced by using hash tree. (2) an improved FP-Growth algorithm for load balancing data segmentation based on Hadoop platform is proposed. FP-Growth serial algorithm is difficult to deal with large amount of data in big data (parallel FP-Growth). The improved algorithm uses the load estimation and the improved equalization grouping method to equalize the packet, which overcomes the disadvantage that the PFP data can not be processed and the load is unbalanced. The improved algorithm can effectively balance the load of each node in the cluster and shorten the running time of the whole cluster. After the big data Hadoop platform framework is built, a comparative experiment is carried out. The validity of the algorithm is verified by authoritative data. Experiments show that the improved algorithm can adapt to big data better and more efficiently.
【學(xué)位授予單位】:西安科技大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2017
【分類號】:TP311.13
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 鄒裕;肖倩;吳樹榮;;基于增強(qiáng)關(guān)聯(lián)規(guī)則挖掘的大型網(wǎng)站推薦系統(tǒng)[J];計(jì)算機(jī)與現(xiàn)代化;2016年10期
2 陳明潔;;分布式頻繁項(xiàng)集挖掘算法[J];計(jì)算機(jī)應(yīng)用與軟件;2015年10期
3 晁永生;孫文磊;;基于粗糙集的焊接類型關(guān)聯(lián)規(guī)則提取[J];計(jì)算機(jī)工程與應(yīng)用;2015年15期
4 呂婉琪;鐘誠;唐印滸;陳志朕;;Hadoop分布式架構(gòu)下大數(shù)據(jù)集的并行挖掘[J];計(jì)算機(jī)技術(shù)與發(fā)展;2014年01期
5 章志剛;吉根林;;一種基于FP-Growth的頻繁項(xiàng)目集并行挖掘算法[J];計(jì)算機(jī)工程與應(yīng)用;2014年02期
6 劉維曉;陳俊麗;屈世富;萬旺根;;一種改進(jìn)的Apriori算法[J];計(jì)算機(jī)工程與應(yīng)用;2011年11期
7 王鋒;李勇華;毋國慶;;基于矩陣的改進(jìn)的Apriori算法[J];計(jì)算機(jī)工程與設(shè)計(jì);2009年10期
8 談恒貴;王文杰;李克雙;;頻繁項(xiàng)集挖掘算法綜述[J];計(jì)算機(jī)仿真;2005年11期
9 陳付幸,王潤生;基于預(yù)檢驗(yàn)的快速隨機(jī)抽樣一致性算法[J];軟件學(xué)報(bào);2005年08期
10 遲利華,劉杰,胡慶豐;數(shù)值并行計(jì)算可擴(kuò)展性評價(jià)與測試[J];計(jì)算機(jī)研究與發(fā)展;2005年06期
相關(guān)碩士學(xué)位論文 前3條
1 車斌;基于Hadoop海量數(shù)據(jù)處理關(guān)鍵技術(shù)研究[D];電子科技大學(xué);2013年
2 魏峰;基于聚類的關(guān)聯(lián)規(guī)則挖掘算法研究[D];浙江工業(yè)大學(xué);2012年
3 謝朋峻;基于MapReduce的頻繁項(xiàng)集挖掘算法的并行化研究[D];南京大學(xué);2012年
,本文編號:2098020
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2098020.html