基于GPU的Dirichlet算法并行計(jì)算設(shè)計(jì)與實(shí)現(xiàn)
發(fā)布時(shí)間:2018-10-05 18:11
【摘要】:近年來(lái),信息技術(shù)的普及和硬件技術(shù)的快速發(fā)展,為大數(shù)據(jù)產(chǎn)生與存儲(chǔ)提供了先決條件。在商業(yè)上、科研機(jī)構(gòu)、政府部門等都存儲(chǔ)著大量的數(shù)據(jù)。而如何從這些大量的數(shù)據(jù)集中提取有用信息成為了人們?nèi)找骊P(guān)注的主題,數(shù)據(jù)挖掘正是在這樣的背景下受到關(guān)注并得到了快速的發(fā)展。聚類作為數(shù)據(jù)挖掘的重要工具,是將相似對(duì)象劃分為同組,不相似對(duì)象劃為不同組的過(guò)程,在各個(gè)領(lǐng)域得到了廣泛的應(yīng)用。 本文首先介紹了數(shù)據(jù)挖掘和聚類分析的基礎(chǔ)理論,并重點(diǎn)研究了Dirichlet混合模型聚類,接著以Apache Mahout機(jī)器學(xué)習(xí)庫(kù)為基礎(chǔ),研究了Dirichlet過(guò)程混合模型算法及其具體實(shí)現(xiàn)。該混合模型是一種以Dirichlet過(guò)程為先驗(yàn)的貝葉斯混合模型。Mahout提供了單機(jī)實(shí)現(xiàn)和MapReduce實(shí)現(xiàn)方式,本文主要研究了后者。文中首先以多組數(shù)據(jù)集作為算法輸入來(lái)研究Dirichlet過(guò)程聚類算法,通過(guò)對(duì)運(yùn)行結(jié)果的分析,得出算法主要開(kāi)銷集中在map函數(shù)的處理這一結(jié)論。 本文還研究了GPU(圖形處理器),并提出了以GPU并行方式來(lái)提高算法效率的改進(jìn)方案。本文研究了GPU的體系架構(gòu)及其優(yōu)勢(shì),以及CUDA并行編程實(shí)現(xiàn)。然后在Mahout提供的Dirichlet過(guò)程混合模型算法源碼基礎(chǔ)上,實(shí)現(xiàn)了以JNI調(diào)用本地CUDA程序的改進(jìn)方案,其中,CUDA程序以并行方式來(lái)處理了map函數(shù)。最后,本文以同樣的數(shù)據(jù)作為輸入,并分析了運(yùn)行結(jié)果。通過(guò)比較源程序與改進(jìn)程序的運(yùn)行性能,得出改進(jìn)的程序提高了算法效率,當(dāng)數(shù)據(jù)量較大時(shí),提升效果更為明顯。這些為數(shù)據(jù)挖掘算法的性能研究提供有益參考。
[Abstract]:In recent years, the popularization of information technology and the rapid development of hardware technology provide a prerequisite for big data to produce and store. In business, research institutions, government departments and so on are storing a lot of data. However, how to extract useful information from these large data sets has become a topic of increasing concern. Data mining has been paid close attention to and developed rapidly under this background. As an important tool of data mining, clustering is the process of dividing similar objects into the same group and dissimilar objects into different groups, and has been widely used in various fields. In this paper, the basic theory of data mining and clustering analysis is introduced, and the Dirichlet hybrid model clustering is studied. Then, based on the Apache Mahout machine learning library, the Dirichlet process hybrid model algorithm and its implementation are studied. The hybrid model is a Bayesian hybrid model with Dirichlet process as a priori. Mahout provides a single machine implementation and a MapReduce implementation. The latter is mainly studied in this paper. In this paper, the multi-group data set is used as the input of the algorithm to study the clustering algorithm of Dirichlet process. Through the analysis of the running results, it is concluded that the main cost of the algorithm is the processing of the map function. This paper also studies GPU (graphics processor) and proposes an improved scheme to improve the efficiency of the algorithm by GPU parallelism. This paper studies the architecture and advantages of GPU, and the implementation of CUDA parallel programming. Then on the basis of the source code of Dirichlet process mixed model algorithm provided by Mahout, an improved scheme of calling local CUDA program by JNI is implemented, in which the map function is processed by JNI program in parallel. Finally, the same data is used as input and the result is analyzed. By comparing the performance of the source program and the improved program, it is concluded that the improved program improves the efficiency of the algorithm, and when the amount of data is large, the improvement effect is more obvious. These provide a useful reference for the performance research of data mining algorithms.
【學(xué)位授予單位】:北京郵電大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2013
【分類號(hào)】:TP311.13;TP338.6
本文編號(hào):2254368
[Abstract]:In recent years, the popularization of information technology and the rapid development of hardware technology provide a prerequisite for big data to produce and store. In business, research institutions, government departments and so on are storing a lot of data. However, how to extract useful information from these large data sets has become a topic of increasing concern. Data mining has been paid close attention to and developed rapidly under this background. As an important tool of data mining, clustering is the process of dividing similar objects into the same group and dissimilar objects into different groups, and has been widely used in various fields. In this paper, the basic theory of data mining and clustering analysis is introduced, and the Dirichlet hybrid model clustering is studied. Then, based on the Apache Mahout machine learning library, the Dirichlet process hybrid model algorithm and its implementation are studied. The hybrid model is a Bayesian hybrid model with Dirichlet process as a priori. Mahout provides a single machine implementation and a MapReduce implementation. The latter is mainly studied in this paper. In this paper, the multi-group data set is used as the input of the algorithm to study the clustering algorithm of Dirichlet process. Through the analysis of the running results, it is concluded that the main cost of the algorithm is the processing of the map function. This paper also studies GPU (graphics processor) and proposes an improved scheme to improve the efficiency of the algorithm by GPU parallelism. This paper studies the architecture and advantages of GPU, and the implementation of CUDA parallel programming. Then on the basis of the source code of Dirichlet process mixed model algorithm provided by Mahout, an improved scheme of calling local CUDA program by JNI is implemented, in which the map function is processed by JNI program in parallel. Finally, the same data is used as input and the result is analyzed. By comparing the performance of the source program and the improved program, it is concluded that the improved program improves the efficiency of the algorithm, and when the amount of data is large, the improvement effect is more obvious. These provide a useful reference for the performance research of data mining algorithms.
【學(xué)位授予單位】:北京郵電大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2013
【分類號(hào)】:TP311.13;TP338.6
【參考文獻(xiàn)】
相關(guān)期刊論文 前2條
1 徐謙;周俊生;陳家駿;;Dirichlet過(guò)程及其在自然語(yǔ)言處理中的應(yīng)用[J];中文信息學(xué)報(bào);2009年05期
2 易瑩瑩;;基于Dirichlet過(guò)程的非參數(shù)貝葉斯方法研究綜述[J];統(tǒng)計(jì)與決策;2012年04期
,本文編號(hào):2254368
本文鏈接:http://sikaile.net/kejilunwen/jisuanjikexuelunwen/2254368.html
最近更新
教材專著