HADOOP作業(yè)啟動性能優(yōu)化實踐

發(fā)布時間：2018-04-26 10:14

本文選題：Hadoop + Split　；參考：《北京交通大學(xué)》2012年碩士論文

【摘要】：本論文闡述了本人在百度公司分布式計算小組工作過程中做過的一個優(yōu)化HADOOP作業(yè)提交時間的項目。這個項目的重點在于優(yōu)化在作業(yè)提交時split過程占用的時間和消耗的內(nèi)存,這是作業(yè)提交過程中最耗時的一步,也是作業(yè)提交前的所有準備工作中最重要的一步,因為它直接關(guān)系到輸入數(shù)據(jù)如何分片,即最終決定了這個作業(yè)具有的map任務(wù)數(shù)量,以及每一個map任務(wù)處理多少數(shù)據(jù)量,每個map任務(wù)優(yōu)先給哪一個節(jié)點的TaskTracker來處理。在百度公司以前的HADOOP版本,以及目前社區(qū)的HADOOP版本中,一直以來都沒有對split這個過程進行過的大的修改或者優(yōu)化,隨著百度公司的HADOOP集群規(guī)模的擴大,大作業(yè)數(shù)量的增加,單個作業(yè)需要輸入的數(shù)據(jù)量越來越大,文件數(shù)量也越來也多,從而導(dǎo)致在提交作業(yè)之前,對這些輸入數(shù)據(jù)進行split過程暴露出了內(nèi)存占用大,耗時長的問題,這兩個問題已經(jīng)嚴重影響到百度HADOOP集群對于大作業(yè)的處理效率,并引起了使用百度HADOOP集群的百度數(shù)據(jù)挖掘,日志分析等部門用戶的不滿,因此,為了提高集群的處理效率,改善用戶體驗,必須要對split這一過程進行優(yōu)化。本人獨立完成對split過程的優(yōu)化工作可以分為四個部分,分別是獲取blockLocations優(yōu)化,ls過程輸入路徑正則表達式中間匹配到文件的優(yōu)化,getSplits占用內(nèi)存過高優(yōu)化和將getSplit過程移植到TaskTracker優(yōu)化。這四部分優(yōu)化分別加速了獲取blockLocation信息的速度,對于遍歷路徑操作在中間層匹配到文件這種情況進行了加速優(yōu)化,對split內(nèi)存優(yōu)化使得split整個過程中占用內(nèi)存大幅下降,并且可以使內(nèi)存占用不依賴作業(yè)的輸入數(shù)據(jù)量,而是依賴于用戶指定的參數(shù)。將整個split過程從客戶端移植到TaskTracker上可以釋放客戶端的壓力,并且利用同集群間網(wǎng)絡(luò)傳輸?shù)膬?yōu)勢來進一步節(jié)省split過程的耗時。經(jīng)過本人對split的優(yōu)化,這一項目已經(jīng)成功上線了百度公司HADOOP集群,并且達到了非常理想的效果。大作業(yè)的提交時間從小時級縮短到了分鐘級,平均split過程速度提升了30-60倍,且整個split過程內(nèi)存可以穩(wěn)定控制在200mb左右,相比之前隨著作業(yè)輸入數(shù)據(jù)量而不斷膨脹的內(nèi)存使用量甚至可以達到3G以上,內(nèi)存的節(jié)省是巨大的。最終這個項目贏得了部門同事和用戶方的好評。
[Abstract]:This paper describes a project that I have done in the distributed computing group of Baidu Company to optimize the submission time of HADOOP jobs. This project focuses on optimizing the amount of time and memory consumed by the split process when the job is submitted, which is the most time-consuming step in the job submission process and the most important step in all preparations before the job is submitted. Because it is directly related to how the input data is partitioned, that is to say, it ultimately determines the number of map tasks that the job has, the amount of data handled by each map task, and the TaskTracker of which node is given priority for each map task. In the previous HADOOP version of Baidu, and in the current HADOOP version of the community, there has been no major modification or optimization of the split process. With the expansion of the scale of Baidu's HADOOP cluster, the number of large operations has increased. The amount of data needed to be input by a single job is increasing, and the number of files is also increasing. Therefore, before submitting a job, the split process for these input data exposes the problems of large memory consumption and long time consuming. These two problems have seriously affected the processing efficiency of Baidu HADOOP cluster for large jobs, and caused dissatisfaction of Baidu data mining, log analysis and other departments using Baidu HADOOP cluster. Therefore, in order to improve the processing efficiency of the cluster, To improve the user experience, the split process must be optimized. I can divide the optimization work of split process into four parts, namely, get the optimization of blockLocations optimization process input path regular expression matching to the file, and optimize the getSplit procedure to TaskTracker optimization by taking up too much memory. These four parts of optimization accelerate the speed of obtaining blockLocation information respectively. The traversal path operation matches to the file in the middle layer, and the split memory optimization greatly reduces the memory occupied in the whole process of split. Moreover, the memory can be used not by the input data of the job, but by the parameters specified by the user. Transplanting the whole split process from the client to the TaskTracker can release the pressure of the client and further save the time consuming of the split process by taking advantage of the network transmission with the cluster. After my split optimization, this project has been successfully launched Baidu HADOOP cluster, and achieved a very satisfactory effect. The submission time of the large job is shortened from the hour level to the minute level, the average speed of the split process is increased 30-60 times, and the memory of the whole split process can be steadily controlled in the 200mb. Compared with the previous expansion of memory usage over 3G with job input data, memory savings are significant. In the end, the project won high praise from departmental colleagues and users.
【學(xué)位授予單位】：北京交通大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2012
【分類號】：TP338.8

【相似文獻】

相關(guān)期刊論文前10條

1 刁志堅;丁娟;萬里勇;;淺談Oracle數(shù)據(jù)庫性能調(diào)優(yōu)[J];科技廣場;2007年11期

2 范孝良,國秀麗;企業(yè)實施ERP應(yīng)注重數(shù)據(jù)庫的性能優(yōu)化[J];機械工程與自動化;2005年05期

3 王衛(wèi)華;;點燃DDR2 DDR2內(nèi)存優(yōu)化指南[J];電腦迷;2007年21期

4 ;Windows XP系統(tǒng)內(nèi)存優(yōu)化指南[J];電腦;2003年01期

5 湛大駿;關(guān)于多媒體電腦的內(nèi)存優(yōu)化問題[J];軟件世界;1996年04期

6 王娜,紀震,賈傳熒,李霞;基于內(nèi)存優(yōu)化和啟發(fā)式深度優(yōu)先搜索的小波零樹圖像編碼算法[J];電子學(xué)報;2005年03期

7 黃賢英,盛利;綜合查詢應(yīng)用性能提升的策略[J];計算機工程與應(yīng)用;2003年15期

8 小新;讓愛機更飚一個檔次內(nèi)存優(yōu)化軟件之面面觀[J];大眾硬件;2003年01期

9 朱克勤;數(shù)據(jù)庫系統(tǒng)性能優(yōu)化方法[J];警察技術(shù);2004年06期

10 李澤平;SQL查詢語句的性能優(yōu)化與索引[J];福建電腦;2005年09期

相關(guān)會議論文前10條

1 方奇;袁茂森;劉志強;;三聚吲哚化合物的選位異構(gòu)及雙光子性能優(yōu)化:長程與短程電荷轉(zhuǎn)移[A];全國第八屆有機固體電子過程暨華人有機光電功能材料學(xué)術(shù)討論會摘要集[C];2010年

2 陳建松;;IBM大型機應(yīng)用系統(tǒng)性能優(yōu)化[A];中國計量協(xié)會冶金分會2011年會論文集[C];2011年

3 曹燕;;ORACLE數(shù)據(jù)庫系統(tǒng)的性能優(yōu)化[A];數(shù)據(jù)庫技術(shù)在氣象領(lǐng)域的應(yīng)用學(xué)術(shù)會議論文集[C];2001年

4 高明星;;DB2數(shù)據(jù)庫應(yīng)用性能優(yōu)化問題淺談[A];科技、工程與經(jīng)濟社會協(xié)調(diào)發(fā)展——中國科協(xié)第五屆青年學(xué)術(shù)年會論文集[C];2004年

5 李昱;;DB2 SQL性能優(yōu)化準則在武鋼物流管理系統(tǒng)中的實踐[A];中國計量協(xié)會冶金分會2011年會論文集[C];2011年

6 高俊;;淺談企業(yè)內(nèi)部局域網(wǎng)的維護[A];計算機技術(shù)在工程建設(shè)中的應(yīng)用——第十二屆全國工程建設(shè)計算機應(yīng)用學(xué)術(shù)會議論文集[C];2004年

7 馮立;王良勇;錢曉龍;;冗余控制系統(tǒng)的原理及性能優(yōu)化[A];中國儀器儀表學(xué)會第五屆青年學(xué)術(shù)會議論文集[C];2003年

8 馮春燕;張晨;周繼成;丁煒;;基于多協(xié)議標記交換MPLS的因特網(wǎng)流量工程[A];中國航空學(xué)會信號與信息處理專業(yè)全國第八屆學(xué)術(shù)會議論文集[C];2004年

9 杝正士;徐冿杴;杝f蒧7;櫖曋;;直流磁控反應(yīng)濺鍍NiCr-CN薄膜性能研究[A];第六屆華東三省一市真空學(xué)術(shù)交流會論文集[C];2009年

10 李周華;嚴毅;;軟件設(shè)計中的性能優(yōu)化與內(nèi)存管理[A];廣西計算機學(xué)會2004年學(xué)術(shù)年會論文集[C];2004年

相關(guān)重要報紙文章前10條

1 本期嘉賓：劉暉（微軟MVP）趙聰（接觸電腦6年的用戶）;我們需要內(nèi)存優(yōu)化軟件嗎？[N];電腦報;2005年

2 安徽方汗;內(nèi)存優(yōu)化利器[N];電腦報;2001年

3 江蘇李兵;內(nèi)存優(yōu)化好助手SuperRam[N];電腦報;2004年

4 ;內(nèi)存優(yōu)化專家Optix[N];中國電腦教育報;2000年

5 孫定;云計算、大數(shù)據(jù)與Hadoop[N];計算機世界;2011年

6 王書琴;把內(nèi)存優(yōu)化到最佳[N];中國電腦教育報;2004年

7 陳翔;性能優(yōu)化只能救火[N];中國計算機報;2007年

8 Poison;nForce主板內(nèi)存優(yōu)化設(shè)置[N];中國計算機報;2002年

9 奧創(chuàng)利高級開發(fā)工程師 Robert A. Aekins 奧創(chuàng)利高級產(chǎn)品經(jīng)理 Gregg Lafontaine;六類系統(tǒng)性能優(yōu)化“秘笈”[N];計算機世界;2002年

10 王志軍;更大、更快、更強[N];電腦報;2001年

相關(guān)博士學(xué)位論文前10條

1 張雷;嵌入式系統(tǒng)性能優(yōu)化若干問題研究[D];電子科技大學(xué);2010年

2 何倩;P2P系統(tǒng)性能優(yōu)化若干關(guān)鍵技術(shù)研究[D];北京郵電大學(xué);2010年

3 吳長澤;數(shù)據(jù)網(wǎng)格中高可用性副本管理及性能優(yōu)化研究[D];重慶大學(xué);2007年

4 吳釗;保證服務(wù)質(zhì)量的動態(tài)Web服務(wù)組合及其性能分析研究[D];武漢大學(xué);2007年

5 朱正林;電站輔機及輔機系統(tǒng)性能優(yōu)化[D];東南大學(xué);2005年

6 張國義;水科學(xué)應(yīng)用網(wǎng)格的若干關(guān)鍵技術(shù)研究[D];中國科學(xué)技術(shù)大學(xué);2007年

7 李衍杰;擴展Markov決策過程的性能靈敏度分析與優(yōu)化[D];中國科學(xué)技術(shù)大學(xué);2006年

8 寧靜紅;R290/CO_2自然工質(zhì)復(fù)疊式制冷循環(huán)系統(tǒng)的理論分析與實驗研究[D];天津大學(xué);2007年

9 鮑秉坤;基于梯度逼近方法的Markov系統(tǒng)及其在通信中的應(yīng)用[D];中國科學(xué)技術(shù)大學(xué);2009年

10 張穎星;面向復(fù)雜系統(tǒng)應(yīng)用的并行離散事件仿真性能優(yōu)化技術(shù)研究[D];國防科學(xué)技術(shù)大學(xué);2011年

相關(guān)碩士學(xué)位論文前10條

1 王謙;HADOOP作業(yè)啟動性能優(yōu)化實踐[D];北京交通大學(xué);2012年

2 林樹新;SmartOSEK的系統(tǒng)設(shè)計和時間性能優(yōu)化[D];浙江大學(xué);2005年

3 熊遠生;網(wǎng)絡(luò)控制系統(tǒng)的滑模預(yù)估變結(jié)構(gòu)控制器設(shè)計及性能優(yōu)化[D];浙江工業(yè)大學(xué);2004年

4 徐慧君;基于JSP平臺的信息發(fā)布系統(tǒng)的設(shè)計與實現(xiàn)[D];北京工業(yè)大學(xué);2004年

5 呂栗;郵件過濾系統(tǒng)中數(shù)據(jù)庫的性能優(yōu)化[D];哈爾濱工程大學(xué);2005年

6 石鵬飛;J2EE企業(yè)門戶網(wǎng)站技術(shù)研究[D];浙江大學(xué);2006年

7 陸琳琳;MD5算法的技術(shù)研究及性能優(yōu)化[D];吉林大學(xué);2006年

8 徐永軍;基于視頻壓縮新標準H.264的軟件編碼器及其性能優(yōu)化的研究[D];山東大學(xué);2005年

9 梁海波;R6160ZC柴油機增壓系統(tǒng)優(yōu)化設(shè)計[D];山東大學(xué);2005年

10 宋龍甫;B231柴油機降低排放及性能優(yōu)化的研究[D];清華大學(xué);2005年

，

本文編號：1805609

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/jisuanjikexuelunwen/1805609.html

上一篇：面向農(nóng)業(yè)科學(xué)數(shù)據(jù)的分布式存儲系統(tǒng)的研究與實現(xiàn)
下一篇：面向通訊同步的多處理器陣列重構(gòu)

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

HADOOP作業(yè)啟動性能優(yōu)化實踐