天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于云平臺(tái)的機(jī)器學(xué)習(xí)算法并行化研究與應(yīng)用

發(fā)布時(shí)間:2018-03-07 11:34

  本文選題:云計(jì)算 切入點(diǎn):Spark 出處:《內(nèi)蒙古師范大學(xué)》2016年碩士論文 論文類型:學(xué)位論文


【摘要】:隨著信息化時(shí)代的到來,數(shù)據(jù)成為了最為寶貴的資源,各行各業(yè)可處理的數(shù)據(jù)以指數(shù)形式增長(zhǎng),包括電子商務(wù)網(wǎng)站的各種商務(wù)數(shù)據(jù)、銀行的各種業(yè)務(wù)數(shù)據(jù)以及生物體的各種基因組數(shù)據(jù)等等,這種爆炸式的數(shù)據(jù)增長(zhǎng),很難在已有的平臺(tái)中得到有效的處理。目前,Hadoop平臺(tái)是在大數(shù)據(jù)中挖掘出有用信息一種相對(duì)高效率的并行化新技術(shù),使用MapReduce(MR)編程框架,數(shù)據(jù)量越大,這種技術(shù)越能體現(xiàn)出其獨(dú)特的優(yōu)勢(shì)。Mahout是一種開源的機(jī)器學(xué)習(xí)(ML)算法庫(kù)屬于Apache社區(qū),基于Hadoop平臺(tái)的MR計(jì)算框架,為程序開發(fā)者提供高效的算法實(shí)例。由于機(jī)器學(xué)習(xí)算法基本屬于迭代計(jì)算,而MR將中間數(shù)據(jù)存放在分布式文件處理系統(tǒng)(HDFS)上,其具有I/O資源消耗高的局限性。原于Mahout機(jī)器學(xué)習(xí)庫(kù)的缺陷,Spark計(jì)算框架應(yīng)運(yùn)而生,Spark主要基于彈性分布式數(shù)據(jù)集(RDD),RDD是分布式內(nèi)存的一個(gè)抽象概念,降低了I/O資源消耗和容錯(cuò)能力的開銷。Spark同樣可以搭建在Hadoop YARN平臺(tái)上,分布式存儲(chǔ)數(shù)據(jù)。伴隨著Spark MLlib的出現(xiàn),使機(jī)器學(xué)習(xí)算法的并行化研究有了質(zhì)的提升。本文主要研究基于Spark MLlib的聚類算法K-means和分類算法決策樹及其組裝樹隨機(jī)森林用來解決單機(jī)無法處理的基因組數(shù)據(jù)問題。K-means算法作為數(shù)據(jù)處理的第一步,用于找到最佳的類別個(gè)數(shù),第二步使用分類算法隨機(jī)森林基于已有的類別訓(xùn)練出模型,用于后續(xù)的類別預(yù)測(cè)。本文算法的研究主要應(yīng)用在基因組數(shù)據(jù)的分析上,但不僅限于此,基于云平臺(tái)和Spark的機(jī)器學(xué)習(xí)算法具有良好的擴(kuò)展性。實(shí)驗(yàn)表明,基于Spark的機(jī)器學(xué)習(xí)算法可以有效的提高對(duì)基因組大數(shù)據(jù)的分析,從而對(duì)基因組數(shù)據(jù)的科學(xué)研究起到積極的促進(jìn)作用。
[Abstract]:With the advent of the information age, data has become the most valuable resource. The data that can be handled by various industries has increased exponentially, including all kinds of commercial data of e-commerce websites. All kinds of data from banks and genomes of organisms, and so on, this explosive growth of data, At present, Hadoop platform is a relatively efficient parallel technology to mine useful information from big data. Using MapReduceMRS programming framework, the larger the amount of data, the greater the amount of data. The more this technology shows its unique advantage. Mahout is an open source machine learning algorithm library belonging to the Apache community, based on the Hadoop platform of Mr computing framework, Because the machine learning algorithm basically belongs to iterative computation, Mr stores the intermediate data on the distributed file processing system (HDFS). It has the limitation of high consumption of I / O resources. The Spark computing framework, which was originally based on the Mahout machine learning library, came into being as an abstract concept of distributed memory, which is mainly based on the elastic distributed data set. This reduces the overhead of I / O resource consumption and fault tolerance. Spark can also be built on the Hadoop YARN platform to store data distributed. With the advent of Spark MLlib, This paper mainly studies K-means clustering algorithm based on Spark MLlib and decision tree and its assembly tree to solve the problem of genome data which can not be processed by single machine. K-means algorithm as the first step in data processing, In order to find the best number of categories, the second step is to use the classification algorithm, a random forest, to train a model based on the existing categories, which can be used to predict the following categories. The research of this algorithm is mainly applied to the analysis of genomic data, but not limited to this. The machine learning algorithm based on cloud platform and Spark has good expansibility. Experiments show that the machine learning algorithm based on Spark can effectively improve the analysis of genome big data. Therefore, it plays an active role in promoting the scientific research of genomic data.
【學(xué)位授予單位】:內(nèi)蒙古師范大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2016
【分類號(hào)】:TP311.13;TP181
,

本文編號(hào):1579113

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/jingjilunwen/dianzishangwulunwen/1579113.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶585c9***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請(qǐng)E-mail郵箱bigeng88@qq.com