基于遺傳算法的分布式數(shù)據(jù)挖掘MapReduce架構(gòu)研究

發(fā)布時間：2018-09-08 17:26

【摘要】：近年來,隨著信息技術(shù)的快速發(fā)展,直接或間接的產(chǎn)生了難以估量的海量數(shù)據(jù),這對傳統(tǒng)數(shù)據(jù)挖掘算法提出了新的挑戰(zhàn),如何提高海量數(shù)據(jù)環(huán)境下傳統(tǒng)數(shù)據(jù)挖掘算法的通用性和性能成為當前的研究熱點。為了解決這一問題,研究人員將傳統(tǒng)數(shù)據(jù)挖掘算法與新興技術(shù)如云計算平臺等融合,利用分布式計算能力提高算法的性能,取得了良好效果。但是由于數(shù)據(jù)挖掘算法種類繁多,單一的數(shù)據(jù)挖掘算法需要特定的實現(xiàn)模式,沒有通用的架構(gòu)滿足數(shù)據(jù)挖掘算法的多樣性,并能同時提高算法的性能。本文在前人經(jīng)驗的基礎上,提出了一種基于遺傳算法的分布式數(shù)據(jù)挖掘MapReduce架構(gòu),旨在幫助用戶更通用的處理數(shù)據(jù)挖掘算法并提升算法的性能。架構(gòu)要素之一的MapReduce提供良好的分布式計算能力,另一要素遺傳算法具有良好的全局搜索和優(yōu)化能力,通過模擬種群進化的方式搜索到最優(yōu)解,使得用戶只需要實現(xiàn)遺傳算法而不必擔心算法的并行化。本文的主要貢獻如下,提出了一種基于遺傳算法的分布式數(shù)據(jù)挖掘MapRed uce架構(gòu),架構(gòu)分為核心層和用戶層,核心層封裝了MapReduce的操作,用戶層提供給用戶擴展接口,通過具體問題實現(xiàn)具體的遺傳算法,可以有效的處理數(shù)據(jù)挖掘算法在海量數(shù)據(jù)方面的應用。架構(gòu)包括六個組件,其中Diver組件是框架的主要部分,主要功能是實現(xiàn)用戶交互并負責啟動集群上的Jobs;Generator組件主要作用是通過調(diào)用用戶層的遺傳算法實現(xiàn),然后配合Driver啟動Job完成種群的進化;Terminator組件的作用是在Generator過程中判斷是否滿足終止條件;Initialiser組件負責初始化種群,該組件是可選的;Migrator組件負責種群遷移策略的實現(xiàn),由用戶層實現(xiàn);最后的SolutionFilter組件則是將符合條件的個體篩選出來,每個組件相互協(xié)作完成架構(gòu)的功能。本文用三個算法對架構(gòu)性能進行驗證,首先設計實現(xiàn)了針對K-Medoids的遺傳算法,以聚類準確率為個體適應度值,利用MapReduce加強聚類計算,實驗顯示得到良好的聚類效果。其次設計實現(xiàn)了針對旅行商問題(Traveling Salesman Problem)的遺傳算法,以個體所經(jīng)過城市距離的倒數(shù)作為適應度函數(shù),距離越短個體的適應度值越高,實驗結(jié)果表明,在架構(gòu)中運行的TSP算法能有效處理大數(shù)據(jù)并且比同等級的算法能更快發(fā)現(xiàn)最優(yōu)解。最后,設計實現(xiàn)了針對特征子集選擇(Feature Subset Selection)問題的遺傳算法,以特征選擇的分類準確率作為適應度值,實驗結(jié)果表明,運行在架構(gòu)下的FSS算法能更快速收斂并提高了準確率。綜上,本文提出的基于遺傳算法的分布是數(shù)據(jù)挖掘MapReduce架構(gòu)在處理海量數(shù)據(jù)環(huán)境下的數(shù)據(jù)挖掘算法時具有良好的表現(xiàn),通過特定問題的遺傳算法實現(xiàn),利用分布式計算提高算法性能,同時利用遺傳算法的全局搜索優(yōu)化能力快速找到最優(yōu)解,研究表明,該架構(gòu)幫助數(shù)據(jù)挖掘算法在處理海量數(shù)據(jù)時效果和性能得到提升。
[Abstract]:In recent years, with the rapid development of information technology, incalculable mass data is produced directly or indirectly, which brings new challenges to traditional data mining algorithms. How to improve the generality and performance of traditional data mining algorithms in mass data environment has become a hot research topic. In order to solve this problem, researchers combine traditional data mining algorithms with emerging technologies such as cloud computing platform, and improve the performance of the algorithm by using distributed computing power, and obtain good results. However, because there are many kinds of data mining algorithms, a single data mining algorithm needs a specific implementation pattern, there is no universal architecture to meet the diversity of data mining algorithms, and can improve the performance of the algorithm at the same time. Based on the previous experience, this paper proposes a distributed data mining MapReduce architecture based on genetic algorithm, which aims to help users process data mining algorithms more generally and improve the performance of the algorithms. MapReduce, one of the architectural elements, provides good distributed computing power, while the other element genetic algorithm has a good global search and optimization capability, and the optimal solution can be found by simulating population evolution. Users only need to implement genetic algorithm and do not have to worry about the parallelization of the algorithm. The main contributions of this paper are as follows: a distributed data mining MapRed uce architecture based on genetic algorithm is proposed. The architecture is divided into core layer and user layer. The core layer encapsulates the operation of MapReduce, and the user layer provides the user with extended interface. The application of data mining algorithm in mass data can be effectively processed by implementing specific genetic algorithm. The architecture consists of six components, in which the Diver component is the main part of the framework. The main function of the architecture is to realize user interaction and start the Jobs;Generator component on the cluster by calling the genetic algorithm in the user layer. Then the role of the evolutionary Terminator component to start the Job complete population with Driver is to determine whether the terminating condition is satisfied or not and initialize the population in the Generator process. The component is the optional Job component which is responsible for the implementation of the population migration strategy, which is implemented by the user layer. The final SolutionFilter component is to filter out qualified individuals, and each component collaborates with each other to complete the architectural functions. In this paper, three algorithms are used to verify the performance of the architecture. Firstly, the genetic algorithm for K-Medoids is designed and implemented. The clustering accuracy is taken as the individual fitness value, and the clustering calculation is strengthened by MapReduce. The experimental results show that the clustering effect is good. Secondly, a genetic algorithm for traveling salesman problem (Traveling Salesman Problem) is designed and implemented. The reciprocal of the city distance is taken as the fitness function. The shorter the distance is, the higher the fitness is. The experimental results show that, The TSP algorithm running in the architecture can deal with big data effectively and can find the optimal solution faster than the same level algorithm. Finally, a genetic algorithm for feature subset selection (Feature Subset Selection) problem is designed and implemented. The classification accuracy of feature selection is taken as the fitness value. The experimental results show that the FSS algorithm running in the framework can converge faster and improve the accuracy. In summary, the distribution based on genetic algorithm proposed in this paper is that the data mining MapReduce architecture has a good performance in dealing with the data mining algorithm under the massive data environment, which is realized by the genetic algorithm with specific problems. Distributed computing is used to improve the performance of the algorithm, and the global search optimization ability of genetic algorithm is used to quickly find the optimal solution. The research shows that the architecture can improve the efficiency and performance of the data mining algorithm in processing massive data.
【學位授予單位】：天津大學
【學位級別】：碩士
【學位授予年份】：2016
【分類號】：TP311.13

【相似文獻】

相關(guān)期刊論文前10條

1 王興成,鄭紫微,賈欣樂;模糊遺傳算法及其應用研究[J];計算技術(shù)與自動化;2000年02期

2 吳瑞鏞,徐大紋;具有年齡結(jié)構(gòu)的遺傳算法[J];桂林電子工業(yè)學院學報;2001年04期

3 楊艷麗,史維祥;一種新的優(yōu)化算法—遺傳算法的設計[J];液壓氣動與密封;2001年02期

4 楊宜康,李雪,彭勤科,黃永宣;具有年齡結(jié)構(gòu)的遺傳算法[J];計算機工程與應用;2002年11期

5 谷峰,吳勇,唐俊;遺傳算法的改進[J];微機發(fā)展;2003年06期

6 ;遺傳算法[J];計算機教育;2004年10期

7 趙義紅,李正文,何其四;生物信息處理系統(tǒng)遺傳算法探討[J];成都理工大學學報(自然科學版);2004年05期

8 劉坤,劉偉波,吳忠強;基于模糊遺傳算法的電液位置伺服系統(tǒng)控制[J];黑龍江科技學院學報;2005年04期

9 張英俐,劉弘 ,馬金剛;遺傳算法作曲系統(tǒng)研究[J];信息技術(shù)與信息化;2005年05期

10 丁發(fā)智;;淺談遺傳算法[J];烏魯木齊成人教育學院學報;2005年04期

相關(guān)會議論文前10條

1 陳家照;廖海濤;張中位;羅寅生;;一種改進的遺傳算法及其在路徑規(guī)劃中的應用[A];2009系統(tǒng)仿真技術(shù)及其應用學術(shù)會議論文集[C];2009年

2 李國云;劉穎;薛梅;鄔志敏;;遺傳算法在高溫空冷冷凝器優(yōu)化設計中的應用[A];第五屆全國制冷空調(diào)新技術(shù)研討會論文集[C];2008年

3 王志軍;李守春;張爽;;改進的遺傳算法在反演問題中的應用[A];新世紀新機遇新挑戰(zhàn)——知識創(chuàng)新和高新技術(shù)產(chǎn)業(yè)發(fā)展（上冊）[C];2001年

4 任燕翔;姜立;劉連民;從滋慶;;改進遺傳算法在三維日照方案優(yōu)化中的應用[A];工程三維模型與虛擬現(xiàn)實表現(xiàn)——第二屆工程建設計算機應用創(chuàng)新論壇論文集[C];2009年

5 韓娟;;遺傳算法概述[A];第三屆河南省汽車工程科技學術(shù)研討會論文集[C];2006年

6 龐國仲;王元西;;基于遺傳算法控制步長的定性仿真方法[A];'2000系統(tǒng)仿真技術(shù)及其應用學術(shù)交流會論文集[C];2000年

7 張忠華;楊淑瑩;;基于遺傳算法的聚類設計[A];全國第二屆信號處理與應用學術(shù)會議�？痆C];2008年

8 何翠紅;區(qū)益善;;遺傳算法及其在計算機編程中的應用[A];1995年中國智能自動化學術(shù)會議暨智能自動化專業(yè)委員會成立大會論文集（下冊）[C];1995年

9 靳開巖;張乃堯;;幾種實用遺傳算法及其比較[A];1996年中國智能自動化學術(shù)會議論文集（下冊）[C];1996年

10 王宏剛;曾建潮;李志宏;;攝動遺傳算法[A];1996年中國智能自動化學術(shù)會議論文集（下冊）[C];1996年

相關(guān)重要報紙文章前10條

1 林京;《神經(jīng)網(wǎng)絡和遺傳算法在水科學領(lǐng)域的應用》將面市[N];中國水利報;2002年

2 本報記者褚寧;數(shù)據(jù)挖掘如“挖金”[N];解放日報;2002年

3 周蓉蓉;數(shù)據(jù)挖掘需要點想像力[N];計算機世界;2004年

4 □中國電信股份有限公司北京研究院張舒博 □北京郵電大學計算機科學與技術(shù)學院牛琨;走出數(shù)據(jù)挖掘的誤區(qū)[N];人民郵電;2006年

5 《網(wǎng)絡世界》記者王瑩;數(shù)據(jù)挖掘保險業(yè)的新藍海[N];網(wǎng)絡世界;2012年

6 劉俊麗;基于地理化的網(wǎng)絡數(shù)據(jù)挖掘與分析提升投資有效性[N];人民郵電;2014年

7 本報記者連曉東;數(shù)據(jù)挖掘：金融信息化新熱點[N];中國電子報;2002年

8 本報記者鳳小華朱仁康;“數(shù)字挖掘軟件”引領(lǐng)中國信息化新浪潮[N];中國電子報;2003年

9 本報記者　史延廷;“成功企業(yè)數(shù)據(jù)挖掘暨數(shù)量化管理論壇”在京舉辦[N];中國旅游報;2002年

10 朱小寧;數(shù)據(jù)挖掘：信息化戰(zhàn)爭的基礎工程[N];解放軍報;2005年

相關(guān)博士學位論文前10條

1 Amjad Mahmood;半監(jiān)督進化集成及其在網(wǎng)絡視頻分類中的應用[D];西南交通大學;2015年

2 李險峰;基于改進遺傳算法的汽車裝配生產(chǎn)線平衡問題研究[D];北京科技大學;2017年

3 周輝仁;遞階遺傳算法理論及其應用研究[D];天津大學;2008年

4 郝國生;交互式遺傳算法中用戶的認知規(guī)律及其應用[D];中國礦業(yè)大學;2009年

5 侯格賢;遺傳算法及其在跟蹤系統(tǒng)中的應用研究[D];西安電子科技大學;1998年

6 馬國田;遺傳算法及其在電磁工程中的應用[D];西安電子科技大學;1998年

7 唐文艷;結(jié)構(gòu)優(yōu)化中的遺傳算法研究和應用[D];大連理工大學;2002年

8 周激流;遺傳算法理論及其在水問題中應用的研究[D];四川大學;2000年

9 劉冀成;基于改進遺傳算法的生物電磁成像與磁場聚焦應用研究[D];四川大學;2005年

10 袁麗華;基于物種進化的遺傳算法研究[D];南京航空航天大學;2009年

相關(guān)碩士學位論文前10條

1 韓來明;基于遺傳算法的分布式數(shù)據(jù)挖掘MapReduce架構(gòu)研究[D];天津大學;2016年

2 張英俐;基于遺傳算法的作曲系統(tǒng)研究[D];山東師范大學;2006年

3 鐘海萍;原對偶遺傳算法與蟻群算法的一種融合算法[D];暨南大學;2013年

4 李志添;模糊遺傳算法與資源優(yōu)化配置的預測控制[D];華南理工大學;2015年

5 王琳琳;新型雙層液壓轎運車車廂的設計研究[D];上海工程技術(shù)大學;2015年

6 李海全;基于遺傳算法的建筑體形系數(shù)及迎風面積比優(yōu)化方法研究[D];華南理工大學;2015年

7 彭騫;基于遺傳算法的山區(qū)高等級公路縱斷面智能優(yōu)化方法研究[D];昆明理工大學;2015年

8 周玉林;基于小波分析和遺傳算法的配電網(wǎng)故障檢測[D];昆明理工大學;2015年

9 郭頌;基于粗糙集和遺傳算法的數(shù)字管道生產(chǎn)管理系統(tǒng)研究[D];昆明理工大學;2015年

10 吳南;數(shù)值逼近遺傳算法的研究應用[D];華南理工大學;2015年

，

本文編號：2231191

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2231191.html

上一篇：面向刑事案件的精細分類與串并案分析技術(shù)研究
下一篇：空間信息自適應融合的高光譜圖像分類方法

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于遺傳算法的分布式數(shù)據(jù)挖掘MapReduce架構(gòu)研究