基于Hadoop的改進(jìn)的并行Fp-Growth算法
[Abstract]:Frequent pattern mining is an important algorithm in the field of data mining. Frequent pattern mining is widely used in the research of transaction database, time series database and many other kinds of database. However, the traditional Frequent-pattern Growth algorithm (Fp-Growth algorithm for short) will meet the bottleneck in both storage and computation when dealing with large-scale data, which requires parallelization of Fp-Growth algorithm. The existing parallel Fp-Growth algorithms have solved the problem of how to partition database transaction sets, and ensured that the partitioned transaction sets are independent of each other. However, the existing parallel Fp-Growth algorithms and transaction set partitioning lack of load balancing considerations. Therefore, the parallel Fp-Growth algorithm for load balancing is the main problem in this paper. Hadoop is an open source distributed parallel programming framework under the Apache Foundation, which allows computer clusters to deal with large data sets distributed by using simple programming models. Hadoop solves the problem of scheduling and distributed storage in parallel computing. Fault-tolerant processing, network communication and other problems, which make developers only need to pay attention to the algorithm itself, while the system itself scheduling problems are handled by Hadoop. For the above reasons, this paper uses Hadoop framework to implement parallel Fp-Growth algorithm. The main work of this paper is as follows: one is to improve the existing parallel Fp-Growth algorithm, the other is to apply the parallel algorithm to mining frequent user access sequences. Firstly, based on the research of the parallel Fp-Growth algorithm at home and abroad, this paper improves the grouping strategy of the existing parallel Fp-Growth algorithm by using the method of estimating the load of each frequent item. Experiments show that the improved parallel Fp-Growth algorithm is superior to the existing parallel Fp-Growth algorithm, and the proposed algorithm has better load balancing ability and execution efficiency. Secondly, because a large amount of user access information is stored in the Web server log, the hidden and valuable user behavior information can be found from the massive data. Therefore, the proposed algorithm is applied to the field of Web log mining, which is used to mine frequent user access sequences. Based on this application direction, the results can provide guidance and reference for the source websites of the log, and have practical application value and commercial value.
【學(xué)位授予單位】:山東大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2013
【分類號】:TP338.6
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 柴黃琪;蘇成;;基于HDFS的安全機(jī)制設(shè)計(jì)[J];計(jì)算機(jī)安全;2010年12期
2 劉永增;張曉景;李先毅;;基于Hadoop/Hive的web日志分析系統(tǒng)的設(shè)計(jì)[J];廣西大學(xué)學(xué)報(bào)(自然科學(xué)版);2011年S1期
3 黎宏劍;劉恒;黃廣文;卜立;;基于Hadoop的海量電信數(shù)據(jù)云計(jì)算平臺研究[J];電信科學(xué);2012年08期
4 陳文波;張秀娟;李林;唐鈞;;基于Hadoop的分布式日志分析系統(tǒng)[J];廣西大學(xué)學(xué)報(bào)(自然科學(xué)版);2011年S1期
5 黃濱;徐勇;呂巖;;基于Hadoop云存儲系統(tǒng)在設(shè)計(jì)院的應(yīng)用研究[J];信息安全與技術(shù);2012年09期
6 黃解軍,潘和平,萬幼川;數(shù)據(jù)挖掘技術(shù)的應(yīng)用研究[J];計(jì)算機(jī)工程與應(yīng)用;2003年02期
7 趙衛(wèi)中;馬慧芳;傅燕翔;史忠植;;基于云計(jì)算平臺Hadoop的并行k-means聚類算法設(shè)計(jì)研究[J];計(jì)算機(jī)科學(xué);2011年10期
8 李成華;張新訪;金海;向文;;MapReduce:新型的分布式并行計(jì)算編程模型[J];計(jì)算機(jī)工程與科學(xué);2011年03期
9 王振宇;郭力;;基于Hadoop的搜索引擎用戶行為分析[J];計(jì)算機(jī)工程與科學(xué);2011年04期
10 趙立江,何欽銘;一種個(gè)性化Web推薦系統(tǒng)的研究與實(shí)現(xiàn)[J];武漢理工大學(xué)學(xué)報(bào)(交通科學(xué)與工程版);2004年05期
相關(guān)碩士學(xué)位論文 前2條
1 楊雅雙;關(guān)聯(lián)規(guī)則的并行挖掘算法研究[D];西安科技大學(xué);2010年
2 楊銘馳;一種基于頻繁模式列表的關(guān)聯(lián)規(guī)則分類算法研究與實(shí)現(xiàn)[D];吉林大學(xué);2012年
,本文編號:2390874
本文鏈接:http://sikaile.net/kejilunwen/jisuanjikexuelunwen/2390874.html