基于并行的Apriori數(shù)據(jù)挖掘算法的研究
發(fā)布時(shí)間:2018-08-21 08:26
【摘要】:數(shù)據(jù)之所以存在價(jià)值,是因?yàn)橥ㄟ^分析數(shù)據(jù),發(fā)現(xiàn)其背后的規(guī)律,可以很好地指導(dǎo)我們未來的生產(chǎn)和工作。隨著互聯(lián)網(wǎng)以及信息技術(shù)的長足進(jìn)步,各行各業(yè)在發(fā)展過程中都積累了大量數(shù)據(jù)。國內(nèi)最先運(yùn)用大數(shù)據(jù)的是幾家大型互聯(lián)網(wǎng)公司。他們有數(shù)以億計(jì)的客戶,這些客戶在網(wǎng)絡(luò)中的行為會(huì)產(chǎn)生大量數(shù)據(jù)。這些公司可以通過分析客戶的消費(fèi)習(xí)慣或閱讀習(xí)慣,有選擇地向客戶推送產(chǎn)品和信息。大數(shù)據(jù)的應(yīng)用在傳統(tǒng)行業(yè)中也非常有價(jià)值。比如電力公司通過數(shù)據(jù)分析可以預(yù)測線路負(fù)載,然后更加精確優(yōu)化電能地儲(chǔ)備和調(diào)配。傳統(tǒng)制造業(yè)根據(jù)使用數(shù)據(jù)反饋制定下一代產(chǎn)品的研發(fā)方案。綜上所述,利用數(shù)據(jù)分析來指導(dǎo)未來的工作已經(jīng)成為發(fā)展的趨勢。所以有效利用數(shù)據(jù),挖掘出數(shù)據(jù)背后的規(guī)律就變得尤為重要。數(shù)據(jù)挖掘技術(shù)在這種背景下應(yīng)運(yùn)而生。數(shù)據(jù)挖掘主要分為六大類。分別是關(guān)聯(lián)算法、分類算法、回歸算法、聚類算法、預(yù)測算法和診斷算法,本文主要介紹關(guān)聯(lián)算法。關(guān)聯(lián)規(guī)則挖掘的經(jīng)典算法之一就是Apriori算法。該算法能夠準(zhǔn)確挖掘出數(shù)據(jù)中相互關(guān)聯(lián)的項(xiàng)。比較典型的問題是超市中貨物擺放問題,商家會(huì)將顧客喜歡一起購買的商品擺放在一起。最初的算法設(shè)計(jì)對(duì)數(shù)據(jù)規(guī)?紤]的不是很充分,在處理超大數(shù)據(jù)集時(shí)可能效率會(huì)比較低。所以本文的思路是對(duì)Apriori算法進(jìn)行一定程度地優(yōu)化,并且通過Map Reduce將算法移植到hadoop平臺(tái)上。那么傳統(tǒng)的Apriori算法就變成分布式算法?梢园讶蝿(wù)以及數(shù)據(jù)分布到集群中,提高挖掘效率。Hadoop平臺(tái)是一種云計(jì)算平臺(tái)。其優(yōu)勢在于可以利用大量廉價(jià)的,非高可靠的硬件來存儲(chǔ)和處理數(shù)據(jù)。并且可以非常便利的利用其編程模型將一些串行的算法改成并發(fā)執(zhí)行的。本文將詳細(xì)介紹hadoop和關(guān)聯(lián)算法的背景知識(shí),還會(huì)討論將apriori算法通過mapreduce編程框架實(shí)現(xiàn)并在hadoop平臺(tái)上部署運(yùn)行的可行性。論證這種做法對(duì)效率提升的效果。希望對(duì)以后的研究人員在算法移植云平臺(tái)有一定的參考。
[Abstract]:The reason why the data exist is that by analyzing the data and finding the law behind it, we can guide our future production and work well. With the rapid progress of the Internet and information technology, a lot of data have been accumulated in the development process of various industries. The first use of big data in China is a few large Internet companies. They have hundreds of millions of customers whose behavior in the network generates a lot of data. These companies can selectively push products and information to customers by analyzing their consumer or reading habits. The application of big data is also very valuable in traditional industries. Power companies, for example, can predict line loads through data analysis, and then optimize the storage and allocation of electricity more accurately. The traditional manufacturing industry formulates the next generation product research and development plan according to the data feedback. To sum up, the use of data analysis to guide future work has become a trend of development. Therefore, the effective use of data, mining the rules behind the data becomes particularly important. Data mining technology emerges as the times require under this background. Data mining is divided into six categories. It is an association algorithm, a classification algorithm, a regression algorithm, a clustering algorithm, a prediction algorithm and a diagnosis algorithm. One of the classical algorithms for mining association rules is the Apriori algorithm. The algorithm can accurately mine the interrelated items in the data. A typical problem is the placement of goods in supermarkets, where merchants place goods that customers like to buy together. The original algorithm design is not enough to consider the size of the data, and may be less efficient when dealing with large data sets. Therefore, the idea of this paper is to optimize the Apriori algorithm to a certain extent, and transplant the algorithm to the hadoop platform through Map Reduce. Then the traditional Apriori algorithm becomes the distributed algorithm. The task and data can be distributed into the cluster. The Hadoop platform is a cloud computing platform. Its advantage is that it can use a lot of cheap, unreliable hardware to store and process data. And it is very convenient to use its programming model to change some serial algorithms into concurrent execution. This paper introduces the background of hadoop and association algorithm in detail, and discusses the feasibility of implementing apriori algorithm through mapreduce programming framework and deploying it on hadoop platform. Demonstrate the effect of this practice on efficiency improvement. Hope that the future of the researchers in the algorithm migration cloud platform has a certain reference.
【學(xué)位授予單位】:吉林大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:TP311.13
本文編號(hào):2195130
[Abstract]:The reason why the data exist is that by analyzing the data and finding the law behind it, we can guide our future production and work well. With the rapid progress of the Internet and information technology, a lot of data have been accumulated in the development process of various industries. The first use of big data in China is a few large Internet companies. They have hundreds of millions of customers whose behavior in the network generates a lot of data. These companies can selectively push products and information to customers by analyzing their consumer or reading habits. The application of big data is also very valuable in traditional industries. Power companies, for example, can predict line loads through data analysis, and then optimize the storage and allocation of electricity more accurately. The traditional manufacturing industry formulates the next generation product research and development plan according to the data feedback. To sum up, the use of data analysis to guide future work has become a trend of development. Therefore, the effective use of data, mining the rules behind the data becomes particularly important. Data mining technology emerges as the times require under this background. Data mining is divided into six categories. It is an association algorithm, a classification algorithm, a regression algorithm, a clustering algorithm, a prediction algorithm and a diagnosis algorithm. One of the classical algorithms for mining association rules is the Apriori algorithm. The algorithm can accurately mine the interrelated items in the data. A typical problem is the placement of goods in supermarkets, where merchants place goods that customers like to buy together. The original algorithm design is not enough to consider the size of the data, and may be less efficient when dealing with large data sets. Therefore, the idea of this paper is to optimize the Apriori algorithm to a certain extent, and transplant the algorithm to the hadoop platform through Map Reduce. Then the traditional Apriori algorithm becomes the distributed algorithm. The task and data can be distributed into the cluster. The Hadoop platform is a cloud computing platform. Its advantage is that it can use a lot of cheap, unreliable hardware to store and process data. And it is very convenient to use its programming model to change some serial algorithms into concurrent execution. This paper introduces the background of hadoop and association algorithm in detail, and discusses the feasibility of implementing apriori algorithm through mapreduce programming framework and deploying it on hadoop platform. Demonstrate the effect of this practice on efficiency improvement. Hope that the future of the researchers in the algorithm migration cloud platform has a certain reference.
【學(xué)位授予單位】:吉林大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:TP311.13
【參考文獻(xiàn)】
相關(guān)期刊論文 前5條
1 顧瑋;;一種改進(jìn)的Apriori算法[J];辦公自動(dòng)化;2016年17期
2 伊瑤瑤;茅蘇;;Hadoop下的關(guān)聯(lián)規(guī)則分析研究[J];計(jì)算機(jī)技術(shù)與發(fā)展;2015年09期
3 程苗;;基于云計(jì)算的Web數(shù)據(jù)挖掘[J];計(jì)算機(jī)科學(xué);2011年S1期
4 袁萬蓮;鄭誠;翟明清;;一種改進(jìn)的Apriori算法[J];計(jì)算機(jī)技術(shù)與發(fā)展;2008年05期
5 張梅峰,張建偉,張新敬,婁淑琴;基于Apriori的有效關(guān)聯(lián)規(guī)則挖掘算法的研究[J];計(jì)算機(jī)工程與應(yīng)用;2003年19期
,本文編號(hào):2195130
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2195130.html
最近更新
教材專著