一種基于MPI和MapReduce的分布式向量計算框架的研究與實現(xiàn)
發(fā)布時間:2018-03-01 04:41
本文關(guān)鍵詞: 分布式計算框架 機(jī)器學(xué)習(xí) 向量MPI MapReduce 出處:《浙江大學(xué)》2013年碩士論文 論文類型:學(xué)位論文
【摘要】:機(jī)器學(xué)習(xí)是近20年來興起的多領(lǐng)域交叉學(xué)科,涉及多門學(xué)科,諸如概率論、統(tǒng)計學(xué)、逼近論、凸分析等等。機(jī)器學(xué)習(xí)算法目前已經(jīng)有了廣泛的應(yīng)用,例如數(shù)據(jù)挖掘、自然語言處理、搜索引擎等等。當(dāng)前各種機(jī)器學(xué)習(xí)算法已經(jīng)有開源的單機(jī)實現(xiàn),但是隨著互聯(lián)網(wǎng)的高速發(fā)展,用戶數(shù)據(jù)量急劇增加,單機(jī)實現(xiàn)已經(jīng)不能滿足工業(yè)界的需求,為了滿足算法的高性能實現(xiàn),開發(fā)人員需要利用MPI, Hadoop/MapReduce等計算框架開發(fā)并行程序。 MPI效率高,編程靈活,擴(kuò)展性好,適合高性能計算,然而也存在一些缺點(diǎn):MPI接口眾多,學(xué)習(xí)成本高;當(dāng)前使用MPI實現(xiàn)高性能程序時,往往需要考慮數(shù)據(jù)切分、網(wǎng)絡(luò)通信等問題,缺少類似MapReduce的計算模型,增加了程序員的負(fù)擔(dān);算法實現(xiàn)專有化不利用代碼復(fù)用,缺少統(tǒng)一抽象的分布式數(shù)據(jù)結(jié)構(gòu);程序容錯性較差。 針對以上缺點(diǎn),本論文綜述了MPI容錯方案和MapReduce的應(yīng)用與改進(jìn),結(jié)合抽象向量接口設(shè)計,提出了一種MPI下基于向量和MapReduce的分布式計算框架。該框架將機(jī)器學(xué)習(xí)算法中的矩陣操作抽象成為分布式向量的操作,同時結(jié)合異步收發(fā)提高網(wǎng)絡(luò)傳輸效率,盡可能重疊CPU計算和網(wǎng)絡(luò)收發(fā)。在此基礎(chǔ)之上,引入checkpoint機(jī)制,增加多輪迭代算法的在MPI環(huán)境中的容錯性。 為了驗證程序的效率和正確性,選擇了PageRank算法進(jìn)行對比實驗。實驗證明,本論文提出框架適合并且能有有效解決符合MapReduce模型的機(jī)器學(xué)習(xí)算法的分布式實現(xiàn)問題。
[Abstract]:Machine learning is a multi-field interdisciplinary subject that has emerged in recent 20 years, involving many subjects, such as probability theory, statistics, approximation theory, convex analysis, etc. Machine learning algorithms have been widely used, such as data mining. Natural language processing, search engine and so on. At present, all kinds of machine learning algorithms have been implemented on an open source single machine, but with the rapid development of the Internet, the amount of user data has increased dramatically, and the single machine implementation has not been able to meet the needs of the industry. In order to achieve the high performance of the algorithm, developers need to use MPI, Hadoop/MapReduce and other computing frameworks to develop parallel programs. MPI has high efficiency, flexible programming, good expansibility and is suitable for high performance computing. However, it also has some disadvantages, such as: MPI interface is numerous and learning cost is high. When using MPI to implement high performance program, we often need to consider data segmentation, network communication and so on. The lack of a computing model similar to MapReduce increases the burden on programmers; the proprietary implementation of the algorithm does not use code reuse and lacks a unified abstract distributed data structure; and the fault tolerance of programs is poor. In view of the above shortcomings, this paper summarizes the application and improvement of MPI fault-tolerant scheme and MapReduce, combined with the design of abstract vector interface. This paper presents a distributed computing framework based on vector and MapReduce in MPI, which abstracts the matrix operation in machine learning algorithm into the operation of distributed vector, and improves the transmission efficiency of network by combining asynchronous transceiver and transceiver. The CPU computing and network transceiver are overlapped as much as possible. On this basis, the checkpoint mechanism is introduced to increase the fault-tolerance of multi-round iterative algorithms in the MPI environment. In order to verify the efficiency and correctness of the program, the PageRank algorithm is chosen to carry out a comparative experiment. The experimental results show that the proposed framework is suitable for and can effectively solve the distributed implementation problem of machine learning algorithm in accordance with the MapReduce model.
【學(xué)位授予單位】:浙江大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2013
【分類號】:TP181
【參考文獻(xiàn)】
相關(guān)碩士學(xué)位論文 前1條
1 牛海波;基于MPI的并行容錯技術(shù)研究與實現(xiàn)[D];國防科學(xué)技術(shù)大學(xué);2011年
,本文編號:1550450
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1550450.html
最近更新
教材專著