基于YARN和哈希技術(shù)的大數(shù)據(jù)K近鄰研究
發(fā)布時(shí)間:2018-12-16 21:15
【摘要】:大數(shù)據(jù)是近幾年機(jī)器學(xué)習(xí)領(lǐng)域最熱門的研究方向之一,大數(shù)據(jù)給傳統(tǒng)的機(jī)器學(xué)習(xí)帶來(lái)了巨大挑戰(zhàn)。K-近鄰是一種著名的分類算法。由于它簡(jiǎn)單且易于實(shí)現(xiàn),所以被廣泛應(yīng)用于許多領(lǐng)域,如人臉識(shí)別、基因分類、決策支持等。然而,在大數(shù)據(jù)環(huán)境中,K-近鄰算法的效率變得非常低,甚至不可行。針對(duì)這一問(wèn)題,基于Yarn和哈希技術(shù),本文提出了兩種解決方案:一種用Mapreduce和SimHash在云計(jì)算平臺(tái)上實(shí)現(xiàn)針對(duì)大數(shù)據(jù)集的K-近鄰分類;另一種用Spark和SimHash在云計(jì)算平臺(tái)上實(shí)現(xiàn)針對(duì)大數(shù)據(jù)集的K-近鄰分類。兩種解決方案的基本思路是類似的,包括三步:(1)對(duì)大數(shù)據(jù)集做哈希變換,將其變換到海明空間;(2)在海明空間中,基于云計(jì)算Yarn平臺(tái)用大數(shù)據(jù)計(jì)算框架Mapreduce和Spark尋找與測(cè)試樣例x在同一個(gè)桶中的訓(xùn)練樣例;(3)在同一個(gè)桶中再尋找測(cè)試樣例x的K個(gè)精確近鄰,并用這K個(gè)精確近鄰對(duì)x進(jìn)行分類。實(shí)驗(yàn)結(jié)果顯示,在分類能力保持的前提下,本文提出的解決方案是可行的,而且可以大幅度地提高K-近鄰算法的效率。
[Abstract]:Big data is one of the most popular research fields in the field of machine learning in recent years. Big data brings great challenges to the traditional machine learning. K- nearest neighbor is a famous classification algorithm. Because it is simple and easy to implement, it is widely used in many fields, such as face recognition, gene classification, decision support and so on. However, in big data environment, the efficiency of K-nearest neighbor algorithm becomes very low, even infeasible. Aiming at this problem, based on Yarn and hash technology, this paper proposes two solutions: one is to use Mapreduce and SimHash to realize K-nearest neighbor classification for big data set on cloud computing platform; Another is to use Spark and SimHash to implement K-nearest neighbor classification for big data set on cloud computing platform. The basic ideas of the two solutions are similar, including three steps: (1) Hash transformation of big data set and transform it into Heming space; (2) in Haiming space, based on cloud computing Yarn platform, big data computing framework Mapreduce and Spark are used to find and test sample x training samples in the same bucket; (3) the K exact nearest neighbors of test sample x are found in the same bucket, and the K exact nearest neighbors are used to classify x. The experimental results show that the proposed scheme is feasible and can greatly improve the efficiency of the K-nearest neighbor algorithm on the premise of maintaining the classification ability.
【學(xué)位授予單位】:河北大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:TP311.13;TP181
本文編號(hào):2383061
[Abstract]:Big data is one of the most popular research fields in the field of machine learning in recent years. Big data brings great challenges to the traditional machine learning. K- nearest neighbor is a famous classification algorithm. Because it is simple and easy to implement, it is widely used in many fields, such as face recognition, gene classification, decision support and so on. However, in big data environment, the efficiency of K-nearest neighbor algorithm becomes very low, even infeasible. Aiming at this problem, based on Yarn and hash technology, this paper proposes two solutions: one is to use Mapreduce and SimHash to realize K-nearest neighbor classification for big data set on cloud computing platform; Another is to use Spark and SimHash to implement K-nearest neighbor classification for big data set on cloud computing platform. The basic ideas of the two solutions are similar, including three steps: (1) Hash transformation of big data set and transform it into Heming space; (2) in Haiming space, based on cloud computing Yarn platform, big data computing framework Mapreduce and Spark are used to find and test sample x training samples in the same bucket; (3) the K exact nearest neighbors of test sample x are found in the same bucket, and the K exact nearest neighbors are used to classify x. The experimental results show that the proposed scheme is feasible and can greatly improve the efficiency of the K-nearest neighbor algorithm on the premise of maintaining the classification ability.
【學(xué)位授予單位】:河北大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:TP311.13;TP181
【參考文獻(xiàn)】
相關(guān)期刊論文 前7條
1 黃宜華;;大數(shù)據(jù)機(jī)器學(xué)習(xí)系統(tǒng)研究進(jìn)展[J];大數(shù)據(jù);2015年01期
2 李武軍;周志華;;大數(shù)據(jù)哈希學(xué)習(xí):現(xiàn)狀與趨勢(shì)[J];科學(xué)通報(bào);2015年Z1期
3 陳潔;陳冬杰;黃幫明;;基于HBASE的大數(shù)據(jù)壓縮算法的研究[J];電腦知識(shí)與技術(shù);2014年13期
4 張長(zhǎng)水;;機(jī)器學(xué)習(xí)面臨的挑戰(zhàn)[J];中國(guó)科學(xué):信息科學(xué);2013年12期
5 姚吉龍;張瀟磊;;基于Hadoop的性能優(yōu)化分析[J];科技創(chuàng)新導(dǎo)報(bào);2013年25期
6 閆永剛;馬廷淮;王建;;KNN分類算法的MapReduce并行化實(shí)現(xiàn)[J];南京航空航天大學(xué)學(xué)報(bào);2013年04期
7 李國(guó)杰;程學(xué)旗;;大數(shù)據(jù)研究:未來(lái)科技及經(jīng)濟(jì)社會(huì)發(fā)展的重大戰(zhàn)略領(lǐng)域——大數(shù)據(jù)的研究現(xiàn)狀與科學(xué)思考[J];中國(guó)科學(xué)院院刊;2012年06期
,本文編號(hào):2383061
本文鏈接:http://sikaile.net/kejilunwen/zidonghuakongzhilunwen/2383061.html
最近更新
教材專著