Hadoop小文件處理技術(shù)的研究與優(yōu)化

發(fā)布時間：2018-12-13 18:28

【摘要】：隨著互聯(lián)網(wǎng)的快速發(fā)展,數(shù)字信息呈現(xiàn)指數(shù)級的增長,人類已經(jīng)邁進了大數(shù).據(jù)的時代。在數(shù)據(jù)存儲和計算方面,傳統(tǒng)的方法已經(jīng)顯得越來越?jīng)]有優(yōu)勢。怎樣高效以及合理地存儲計算大批量數(shù)據(jù)已經(jīng)成為國內(nèi)外各行各業(yè)關(guān)心的重點。鑒于對數(shù)據(jù)計算以及數(shù)據(jù)存儲的高要求,云計算的概念由此產(chǎn)生。隨著云計算技術(shù)的迅速發(fā)展,存儲和計算成為最熱門的研究范疇。Hadoop是Apache基金會的一個開源的項目,它在分布式存儲以及分布式計算方面表現(xiàn)出杰出的性能,引發(fā)了國內(nèi)外廣泛的關(guān)注,目前,越來越多的高校以及企業(yè)都開始應(yīng)用Hadoop支撐自己的業(yè)務(wù)與需求。盡管Hadoop是專門為存儲和計算大數(shù)據(jù)而設(shè)計的,但是當(dāng)Hadoop存儲小文件時,會給主節(jié)點帶來巨大的內(nèi)存壓力,影響文件的訪問效率,并且間接影響MapReduce編程模型的計算效率。本文基于Hadoop的MapReduce計算模型和HDFS分布式文件系統(tǒng)兩個核心內(nèi)容,著重研究了基于Hadoop的小文件處理技術(shù)的通用優(yōu)化。為了解決Hadoop技術(shù)在存儲和計算小文件時會給NameNode節(jié)點造成內(nèi)存的浪費、讀取文件效率低下以及MapReduce模型計算效率低的問題,首先研究Hadoop本身自帶的小文件處理技術(shù),深入分析了它們的優(yōu)缺點,并分別從MapReduce層面和HDFS層面對Hadoop進行研究與優(yōu)化,提高Hadoop存儲、計算小文件的效率。在MapReduce層面,對MapReduce的執(zhí)行流程以及InputFormat體系結(jié)構(gòu)進行深入研究,詳細分析MapReduce源代碼以及內(nèi)部方法的具體實現(xiàn)。通過深入研究以及實現(xiàn)CombineFileInputFormat抽象類,在MapReduce層面對小文件的輸入格式進行合并,提高了Hadoop對小文件的計算效率。在HDFS層面,本文提出一種具有獨立小文件處理模塊的分布式文件系統(tǒng),它不依賴于HDFS,整個模塊可以和Hadoop集群做到很好的解耦,互不影響。此模塊對小文件進行合并,索引映射以及讀取,并加入小文件緩存模塊,提高文件的訪問效率,并間接提高MapReduce在計算處理小文件時的效率。最后,通過實驗驗證,自定義的CombineFileInputFormat在MapReduce的處理效率上高于其他的輸入格式。獨立的小文件處理模塊,也加速了對文件的訪問,并且降低了主節(jié)點的內(nèi)存壓力。
[Abstract]:With the rapid development of the Internet and the exponential growth of digital information, mankind has entered a large number. The era of evidence. In the aspect of data storage and computing, the traditional method has no advantage. How to store large quantities of data efficiently and reasonably has become the focus of various industries at home and abroad. In view of the high demand for data computing and data storage, the concept of cloud computing has come into being. With the rapid development of cloud computing technology, storage and computing has become the most popular research field. Hadoop is an open source project of the Apache Foundation, it shows outstanding performance in distributed storage and distributed computing. At present, more and more universities and enterprises begin to use Hadoop to support their business and demand. Although Hadoop is specially designed to store and compute big data, when Hadoop stores small files, it will bring huge memory pressure to the master node, affect the access efficiency of files, and indirectly affect the computational efficiency of MapReduce programming model. Based on the MapReduce computing model of Hadoop and the distributed file system of HDFS, this paper focuses on the general optimization of small file processing technology based on Hadoop. In order to solve the problem that Hadoop technology will cause memory waste to NameNode nodes when storing and calculating small files, low efficiency of reading files and low computational efficiency of MapReduce model, this paper first studies the small file processing technology of Hadoop itself. The advantages and disadvantages of them are analyzed in depth, and the Hadoop is studied and optimized from the MapReduce level and the HDFS level, which can improve the efficiency of Hadoop storage and compute small files. At the level of MapReduce, the implementation process and InputFormat architecture of MapReduce are studied in depth, and the source code of MapReduce and the implementation of internal methods are analyzed in detail. Through in-depth research and implementation of CombineFileInputFormat abstract classes, the input format of small files is merged at the MapReduce level, which improves the efficiency of computing small files in Hadoop. At the level of HDFS, this paper presents a distributed file system with independent small file processing modules. It does not depend on HDFS, to decouple the whole module from the Hadoop cluster without affecting each other. This module can merge, index map and read small files, and add small file cache module to improve the access efficiency of files, and indirectly improve the efficiency of MapReduce in computing and processing small files. Finally, the experimental results show that the MapReduce processing efficiency of the custom CombineFileInputFormat is higher than that of other input formats. Independent small file processing module also speeds up access to files and reduces the memory pressure on the primary node.
【學(xué)位授予單位】：廣東工業(yè)大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2016
【分類號】：TP311.13

【相似文獻】

相關(guān)期刊論文前10條

1 王亞男;文件處理要程序化、制度化和現(xiàn)代化[J];上海海運學(xué)院學(xué)報;1995年04期

2 李斌;文件處理系統(tǒng)[J];管理科學(xué)文摘;1996年04期

3 李文龍;;文件處理新工具[J];辦公自動化;2000年03期

4 曾輝;;基于C#的文件處理[J];軟件導(dǎo)刊;2006年09期

5 王健;從農(nóng)業(yè)部的經(jīng)驗看提高機關(guān)文件工作水平的要素[J];檔案學(xué)通訊;1993年03期

6 李文龍;文件處理新工具[J];電子科技;2000年03期

7 王方鴻;數(shù)字時代的辦公文件處理中心[J];信息系統(tǒng)工程;2001年02期

8 何偉;陳永強;;C#的文件處理研究與實例分析[J];電腦知識與技術(shù);2009年21期

9 安忻，，曹潤芳;應(yīng)加強文件處理的法制建設(shè)[J];檔案學(xué)通訊;1994年02期

10 王海玲,崔杜武;文件處理軟件的研制[J];管理信息系統(tǒng);1999年07期

相關(guān)重要報紙文章前5條

1 本報記者　梁圖強;文件處理變“聰明”了[N];經(jīng)濟日報;2002年

2 河南段永軍;巧用WPS 2002制作文件處理簽?zāi)０錥N];電腦報;2003年

3 小彭;辦公文檔批量替換好輕松[N];電腦報;2004年

4 記者黃繼妍;公共機構(gòu)無紙化辦公日趨普遍[N];江西日報;2014年

5 郭濤;StorNext比NAS強在哪里[N];中國計算機報;2008年

相關(guān)碩士學(xué)位論文前10條

1 李虎嘯;海量qos文件處理與數(shù)據(jù)分析[D];復(fù)旦大學(xué);2013年

2 張翔;基于NoSQL的ETC文件處理系統(tǒng)的設(shè)計與實現(xiàn)[D];中國科學(xué)院大學(xué)(工程管理與信息技術(shù)學(xué)院);2015年

3 馬越;Hadoop平臺下的海量小文件處理研究[D];南京郵電大學(xué);2015年

4 姚云飛;Hadoop海量小文件處理技術(shù)的應(yīng)用研究[D];南京郵電大學(xué);2015年

5 關(guān)海超;小文件處理及算法并行化在Hadoop上的設(shè)計與實現(xiàn)[D];重慶大學(xué);2015年

6 趙菲;Hadoop小文件處理技術(shù)的研究與優(yōu)化[D];廣東工業(yè)大學(xué);2016年

7 南海濤;泰達電子文件處理系統(tǒng)設(shè)計與實現(xiàn)[D];天津大學(xué);2008年

8 劉通;基于HDFS的小文件處理與副本策略優(yōu)化研究[D];中國海洋大學(xué);2014年

9 李三淼;Hadoop中小文件處理方法的研究與分析[D];安徽大學(xué);2015年

10 擺卿卿;PDF文件處理系統(tǒng)[D];北京交通大學(xué);2009年

本文編號：2377017

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2377017.html

上一篇：基于知網(wǎng)的科學(xué)效應(yīng)知識獲取和本體庫填充方法研究
下一篇：運動視覺跟蹤電子設(shè)備的改進設(shè)計

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

Hadoop小文件處理技術(shù)的研究與優(yōu)化