基于云平臺的關聯(lián)規(guī)則算法優(yōu)化及應用研究
本文選題:云計算 + 數(shù)據(jù)挖掘 ; 參考:《河南工業(yè)大學》2017年碩士論文
【摘要】:隨著互聯(lián)網(wǎng)的快速發(fā)展,網(wǎng)絡已深入到生活的方方面面。互聯(lián)網(wǎng)豐富、方便了大眾的生活,甚至一定程度上改變了人們的工作方式。隨著互聯(lián)網(wǎng)技術的廣泛應用,后臺產(chǎn)生的數(shù)據(jù)信息規(guī)模呈現(xiàn)海量化。如何在大數(shù)據(jù)中挖掘出價值信息得到了各行業(yè)的關注。從大規(guī)模噪雜的的數(shù)據(jù)集合中挖掘出事物之間的關聯(lián)規(guī)則是數(shù)據(jù)挖掘技術中一個較為廣泛的應用。但是傳統(tǒng)的單機數(shù)據(jù)挖掘無法實現(xiàn)對海量數(shù)據(jù)的全面分析,云計算的出現(xiàn)為數(shù)據(jù)挖掘行業(yè)提出了新思路。Apache基金會研發(fā)的Hadoop云平臺降低了云計算開發(fā)的技術門檻。將云平臺的并行計算技術與改進后的關聯(lián)規(guī)則算法相結合,能夠更好地實現(xiàn)對海量數(shù)據(jù)的挖掘操作,得出蘊含在數(shù)據(jù)集中的信息規(guī)律,從而為商業(yè)應用提供出更好地決策。本文以傳統(tǒng)的Apriori算法為研究的理論基礎,通過分析算法的執(zhí)行流程找出可優(yōu)化的關鍵點,對算法進行了相應的改進,將改進后的Apriori算法與Hadoop平臺相結合,算法部署在云平臺上用以實現(xiàn)算法的并行化,以此來達到對海量數(shù)據(jù)的處理。文中對當前云計算以及數(shù)據(jù)挖掘技術的研究現(xiàn)狀和發(fā)展做了詳細論述,在Hadoop技術中著重介紹了HDFS和MapReduce兩個核心技術。第三章對傳統(tǒng)的Apriori關聯(lián)算法做了分析,并以實例的形式論述算法執(zhí)行存在的缺陷,同時介紹了已存在的算法優(yōu)化的方法,列出了性能上的對比。文章第四、第五章是是所研究的核心內(nèi)容,其主要內(nèi)容是:第四章針對傳統(tǒng)的Apriori算法提出了改進,降低算法執(zhí)行的時間復雜度,提高了算法的執(zhí)行效率;然后引入了興趣度閾值的概念對算法挖掘產(chǎn)生的規(guī)則做進一步的篩選,提高強關聯(lián)規(guī)則的有效性、可用性,并以折線圖的方式將實驗分析所得出的結果呈現(xiàn)出來,對比得出結論。第五章著重介紹了搭建Hadoop平臺的流程及常規(guī)配置,闡述了算法并行化的思想,介紹了零售行業(yè)對云計算關聯(lián)分析技術的需求,將優(yōu)化的Apriori算法部署在Hadoop平臺上與普通的串行算法的執(zhí)行效率做對比,以實驗結果分析論述算法并行化的可行性及優(yōu)勢。
[Abstract]:With the rapid development of the Internet, the network has penetrated into all aspects of life. The Internet is rich, convenient for people's life, and even changes the way people work to a certain extent. With the wide application of Internet technology, the scale of data information produced in the background presents sea quantification. How to dig out value information in big data has been concerned by various industries. Mining association rules between objects from large scale noisy data sets is a more extensive application in data mining technology. However, traditional single-machine data mining can not achieve a comprehensive analysis of massive data, cloud computing for the data mining industry put forward a new idea. Apache Foundation research and development of Hadoop cloud platform to reduce the technical threshold of cloud computing development. By combining the parallel computing technology of cloud platform with the improved association rules algorithm, the mining operation of massive data can be realized better, and the information law contained in the data set can be obtained, thus providing better decision for commercial applications. Based on the traditional Apriori algorithm, this paper finds out the key points that can be optimized by analyzing the execution flow of the algorithm, and improves the algorithm accordingly. The improved Apriori algorithm is combined with the Hadoop platform. The algorithm is deployed on the cloud platform to realize the parallelization of the algorithm so as to process the massive data. In this paper, the current research status and development of cloud computing and data mining technology are discussed in detail, and two core technologies, HDFS and MapReduce, are emphatically introduced in Hadoop technology. In the third chapter, the traditional Apriori association algorithm is analyzed, and the shortcomings of the algorithm execution are discussed in the form of an example. At the same time, the existing algorithm optimization methods are introduced, and the performance comparison is given. The fourth chapter and the fifth chapter are the core contents of the research. The main contents are as follows: in the fourth chapter, the traditional Apriori algorithm is improved, the time complexity of the algorithm is reduced, and the efficiency of the algorithm is improved. Then the concept of interest threshold is introduced to further filter the rules generated by algorithm mining, to improve the effectiveness and availability of strong association rules, and the results of experimental analysis are presented by the way of broken line graph. Draw a conclusion by contrast. The fifth chapter mainly introduces the flow and general configuration of Hadoop platform, expounds the idea of algorithm parallelization, and introduces the demand of cloud computing association analysis technology in retail industry. The optimized Apriori algorithm is deployed on the Hadoop platform and compared with the execution efficiency of the ordinary serial algorithm. The feasibility and advantages of parallelization of the algorithm are discussed with the experimental results.
【學位授予單位】:河南工業(yè)大學
【學位級別】:碩士
【學位授予年份】:2017
【分類號】:TP311.13;TP393.09
【參考文獻】
相關期刊論文 前10條
1 柴巖;張京輝;魯新新;;最小支持度為區(qū)間值的加權Apriori算法[J];遼寧工程技術大學學報(自然科學版);2016年12期
2 段春梅;;云計算分布式緩存技術在海量數(shù)據(jù)處理平臺中的應用[J];智能計算機與應用;2016年01期
3 王來;翟健宏;;基于HDFS的分布式存儲策略分析[J];智能計算機與應用;2016年01期
4 林長方;吳揚揚;黃仲開;曾少俊;;基于MapReduce的Apriori算法并行化[J];江南大學學報(自然科學版);2014年04期
5 周勇;池麗華;;大數(shù)據(jù)時代零售業(yè)的五項對策[J];上海商學院學報;2014年04期
6 李雷;黃蓉;;基于Apriori的快速剪枝和連接的新算法(英文)[J];計算機技術與發(fā)展;2014年05期
7 王娟;;一種基于DHP算法的頻繁項集改進方法[J];科技視界;2013年31期
8 屠要峰;錢煜明;;一種基于海量數(shù)據(jù)的信息云系統(tǒng)及其關鍵技術研究[J];電信科學;2012年12期
9 劉正偉;文中領;張海濤;;云計算和云數(shù)據(jù)管理技術[J];計算機研究與發(fā)展;2012年S1期
10 李成華;張新訪;金海;向文;;MapReduce:新型的分布式并行計算編程模型[J];計算機工程與科學;2011年03期
相關碩士學位論文 前10條
1 董金鳳;數(shù)據(jù)挖掘中關聯(lián)規(guī)則算法的改進與并行化處理[D];哈爾濱理工大學;2016年
2 任田田;云數(shù)據(jù)中心中虛擬機初始化放置策略的優(yōu)化算法及其應用研究[D];華東師范大學;2015年
3 賈玉辰;Hadoop中海量小文件存取關鍵技術的研究與實現(xiàn)[D];南京郵電大學;2015年
4 王達明;基于云計算與醫(yī)療大數(shù)據(jù)的Apriori算法的優(yōu)化研究[D];北京郵電大學;2015年
5 陳積富;云計算模式下Web服務QoS預測技術研究[D];江西財經(jīng)大學;2014年
6 姚吉龍;基于大數(shù)據(jù)的Hadoop并行計算優(yōu)化處理性能分析[D];南京郵電大學;2014年
7 段玉琴;數(shù)據(jù)挖掘中關聯(lián)規(guī)則算法的研究[D];西安電子科技大學;2011年
8 李寬;基于HDFS的分布式Namenode節(jié)點模型的研究[D];華南理工大學;2011年
9 曹風兵;基于Hadoop的云計算模型研究與應用[D];重慶大學;2011年
10 寶智紅;C2C電子商務下顧客購買行為的實證研究[D];西南財經(jīng)大學;2010年
,本文編號:1796186
本文鏈接:http://sikaile.net/guanlilunwen/ydhl/1796186.html