MapReduce下區(qū)間連接方法研究
發(fā)布時(shí)間:2018-05-08 20:16
本文選題:區(qū)間連接 + 集合分類; 參考:《華中科技大學(xué)》2016年碩士論文
【摘要】:隨著網(wǎng)絡(luò)技術(shù)的飛速發(fā)展,全球數(shù)據(jù)倍增,為大數(shù)據(jù)的分析和處理帶來(lái)了困難。Map Reduce作為新興的數(shù)據(jù)密集型計(jì)算編程模型,在大數(shù)據(jù)分析與處理方面發(fā)揮了重要的作用。而區(qū)間連接是屬性取值在一個(gè)范圍內(nèi)的連接運(yùn)算,是大數(shù)據(jù)分析和處理的重要運(yùn)算,如何利用Map Reduce編程平臺(tái)提升區(qū)間連接的效率具有重要的意義。在Allen提出的區(qū)間元組概念、區(qū)間元組關(guān)系的基礎(chǔ)上,設(shè)計(jì)了一種基于集合分類實(shí)現(xiàn)二路區(qū)間和多路區(qū)間的連接算法。首先將參與運(yùn)算的區(qū)間元組根據(jù)區(qū)間范圍均勻劃分成若干個(gè)分區(qū),根據(jù)元組與分區(qū)是否有交集,將元組映射到相應(yīng)的分區(qū)集合,對(duì)每個(gè)元組在分區(qū)中的位置進(jìn)行分類,定義了四種類型的集合分類,并分析了每個(gè)分區(qū)中四種類型集合分類占分區(qū)數(shù)據(jù)總量的比例。其次用Map Reduce分布式編程框架編程實(shí)現(xiàn)二路區(qū)間和多路區(qū)間連接算法。通過(guò)四種集合分類構(gòu)建的鍵值對(duì)可以過(guò)濾掉不需要參與連接的元組,減少M(fèi)ap端數(shù)據(jù)傳輸量和Reduce端數(shù)據(jù)計(jì)算量,提升區(qū)間連接的效率。最后,根據(jù)各個(gè)集合分類占各個(gè)分區(qū)數(shù)據(jù)總量的比例,分別制定二路區(qū)間和多路區(qū)間的負(fù)載均衡策略,重新組合各個(gè)分區(qū)之間的集合分類生成新的鍵值對(duì),均衡各個(gè)Reduce節(jié)點(diǎn)收到的數(shù)據(jù),以進(jìn)一步提高區(qū)間連接作業(yè)的完成效率。在搭建的分布式Hadoop平臺(tái)下分別對(duì)二路區(qū)間連接和多路區(qū)間連接方法進(jìn)行了有效性的驗(yàn)證。實(shí)驗(yàn)結(jié)果表明,基于集合分類的區(qū)間連接方法能適用于多種情況,相比已有二路區(qū)間連接和多路區(qū)間連接方法具有一定的優(yōu)勢(shì),并且制定的負(fù)載均衡策略能進(jìn)一步提升效率。
[Abstract]:With the rapid development of network technology, the global data is multiplying, which brings difficulties to the analysis and processing of large data..Map Reduce is a new data intensive programming model, which plays an important role in the analysis and processing of large data. And the important operation of processing, how to use Map Reduce programming platform to improve the efficiency of the interval connection is of great significance. Based on the concept of interval tuples and interval tuples proposed by Allen, a connection algorithm based on set classification is designed to realize the connection between the two path interval and the multipath interval. First, the interval tuples involved in the operation are based on the algorithm. The interval range is divided into several partitions. According to whether the tuple and the partition have intersection, the tuples are mapped to the corresponding partition sets, the positions of each tuple in the partition are classified, four types of set classification are defined, and the proportion of the four types of set classification in each partition is analyzed. Secondly, Ma is used. P Reduce distributed programming framework programming two road interval and multipath interval connection algorithm. Through four sets of set of key values, we can filter the tuples that do not need to join, reduce the amount of data transmission in the Map end and the amount of data in the Reduce end, and improve the efficiency of the interval connection. Finally, according to each set classification, each partition occupies each partition. In the proportion of total data, the load balancing strategy of two roads and multiple intervals is formulated respectively, and the set classification between each partition is recombined to generate a new key value pair, and the data received by each Reduce node is balanced to further improve the completion efficiency of the interval connection operation. In the distributed Hadoop platform, the two road intervals are respectively set up. The effectiveness of connection and multiple interval connection method is verified. The experimental results show that the interval connection method based on the set classification can be applied to a variety of situations. Compared with the existing two way interval connection and multipath interval connection method, the proposed load balancing strategy can further improve the efficiency.
【學(xué)位授予單位】:華中科技大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2016
【分類號(hào)】:TP311.13
【參考文獻(xiàn)】
相關(guān)期刊論文 前4條
1 張延松;;數(shù)據(jù)庫(kù)與MapReduce融合的大數(shù)據(jù)管理技術(shù)探索[J];科研信息化技術(shù)與應(yīng)用;2013年01期
2 孟小峰;慈祥;;大數(shù)據(jù)管理:概念、技術(shù)與挑戰(zhàn)[J];計(jì)算機(jī)研究與發(fā)展;2013年01期
3 覃雄派;王會(huì)舉;杜小勇;王珊;;大數(shù)據(jù)分析——RDBMS與MapReduce的競(jìng)爭(zhēng)與共生[J];軟件學(xué)報(bào);2012年01期
4 姜素芳;陳天滋;;空間連接優(yōu)化方法的研究[J];計(jì)算機(jī)工程;2007年02期
相關(guān)博士學(xué)位論文 前1條
1 黃繼先;基于R-樹的空間數(shù)據(jù)庫(kù)查詢技術(shù)研究[D];中南大學(xué);2005年
相關(guān)碩士學(xué)位論文 前2條
1 孫惠;基于Hadoop框架的大數(shù)據(jù)集連接優(yōu)化算法[D];南京郵電大學(xué);2013年
2 李俊潔;空間數(shù)據(jù)庫(kù)中空間連接和查詢優(yōu)化研究[D];哈爾濱理工大學(xué);2008年
,本文編號(hào):1862908
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1862908.html
最近更新
教材專著