Hadoop小文件處理技術(shù)的研究與優(yōu)化
[Abstract]:With the rapid development of the Internet and the exponential growth of digital information, mankind has entered a large number. The era of evidence. In the aspect of data storage and computing, the traditional method has no advantage. How to store large quantities of data efficiently and reasonably has become the focus of various industries at home and abroad. In view of the high demand for data computing and data storage, the concept of cloud computing has come into being. With the rapid development of cloud computing technology, storage and computing has become the most popular research field. Hadoop is an open source project of the Apache Foundation, it shows outstanding performance in distributed storage and distributed computing. At present, more and more universities and enterprises begin to use Hadoop to support their business and demand. Although Hadoop is specially designed to store and compute big data, when Hadoop stores small files, it will bring huge memory pressure to the master node, affect the access efficiency of files, and indirectly affect the computational efficiency of MapReduce programming model. Based on the MapReduce computing model of Hadoop and the distributed file system of HDFS, this paper focuses on the general optimization of small file processing technology based on Hadoop. In order to solve the problem that Hadoop technology will cause memory waste to NameNode nodes when storing and calculating small files, low efficiency of reading files and low computational efficiency of MapReduce model, this paper first studies the small file processing technology of Hadoop itself. The advantages and disadvantages of them are analyzed in depth, and the Hadoop is studied and optimized from the MapReduce level and the HDFS level, which can improve the efficiency of Hadoop storage and compute small files. At the level of MapReduce, the implementation process and InputFormat architecture of MapReduce are studied in depth, and the source code of MapReduce and the implementation of internal methods are analyzed in detail. Through in-depth research and implementation of CombineFileInputFormat abstract classes, the input format of small files is merged at the MapReduce level, which improves the efficiency of computing small files in Hadoop. At the level of HDFS, this paper presents a distributed file system with independent small file processing modules. It does not depend on HDFS, to decouple the whole module from the Hadoop cluster without affecting each other. This module can merge, index map and read small files, and add small file cache module to improve the access efficiency of files, and indirectly improve the efficiency of MapReduce in computing and processing small files. Finally, the experimental results show that the MapReduce processing efficiency of the custom CombineFileInputFormat is higher than that of other input formats. Independent small file processing module also speeds up access to files and reduces the memory pressure on the primary node.
【學(xué)位授予單位】:廣東工業(yè)大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2016
【分類號(hào)】:TP311.13
【相似文獻(xiàn)】
相關(guān)期刊論文 前10條
1 王亞男;文件處理要程序化、制度化和現(xiàn)代化[J];上海海運(yùn)學(xué)院學(xué)報(bào);1995年04期
2 李斌;文件處理系統(tǒng)[J];管理科學(xué)文摘;1996年04期
3 李文龍;;文件處理新工具[J];辦公自動(dòng)化;2000年03期
4 曾輝;;基于C#的文件處理[J];軟件導(dǎo)刊;2006年09期
5 王健;從農(nóng)業(yè)部的經(jīng)驗(yàn)看提高機(jī)關(guān)文件工作水平的要素[J];檔案學(xué)通訊;1993年03期
6 李文龍;文件處理新工具[J];電子科技;2000年03期
7 王方鴻;數(shù)字時(shí)代的辦公文件處理中心[J];信息系統(tǒng)工程;2001年02期
8 何偉;陳永強(qiáng);;C#的文件處理研究與實(shí)例分析[J];電腦知識(shí)與技術(shù);2009年21期
9 安忻,,曹潤(rùn)芳;應(yīng)加強(qiáng)文件處理的法制建設(shè)[J];檔案學(xué)通訊;1994年02期
10 王海玲,崔杜武;文件處理軟件的研制[J];管理信息系統(tǒng);1999年07期
相關(guān)重要報(bào)紙文章 前5條
1 本報(bào)記者 梁圖強(qiáng);文件處理變“聰明”了[N];經(jīng)濟(jì)日?qǐng)?bào);2002年
2 河南 段永軍;巧用WPS 2002制作文件處理簽?zāi)0錥N];電腦報(bào);2003年
3 小彭;辦公文檔批量替換好輕松[N];電腦報(bào);2004年
4 記者 黃繼妍;公共機(jī)構(gòu)無(wú)紙化辦公日趨普遍[N];江西日?qǐng)?bào);2014年
5 郭濤;StorNext比NAS強(qiáng)在哪里[N];中國(guó)計(jì)算機(jī)報(bào);2008年
相關(guān)碩士學(xué)位論文 前10條
1 李虎嘯;海量qos文件處理與數(shù)據(jù)分析[D];復(fù)旦大學(xué);2013年
2 張翔;基于NoSQL的ETC文件處理系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)[D];中國(guó)科學(xué)院大學(xué)(工程管理與信息技術(shù)學(xué)院);2015年
3 馬越;Hadoop平臺(tái)下的海量小文件處理研究[D];南京郵電大學(xué);2015年
4 姚云飛;Hadoop海量小文件處理技術(shù)的應(yīng)用研究[D];南京郵電大學(xué);2015年
5 關(guān)海超;小文件處理及算法并行化在Hadoop上的設(shè)計(jì)與實(shí)現(xiàn)[D];重慶大學(xué);2015年
6 趙菲;Hadoop小文件處理技術(shù)的研究與優(yōu)化[D];廣東工業(yè)大學(xué);2016年
7 南海濤;泰達(dá)電子文件處理系統(tǒng)設(shè)計(jì)與實(shí)現(xiàn)[D];天津大學(xué);2008年
8 劉通;基于HDFS的小文件處理與副本策略優(yōu)化研究[D];中國(guó)海洋大學(xué);2014年
9 李三淼;Hadoop中小文件處理方法的研究與分析[D];安徽大學(xué);2015年
10 擺卿卿;PDF文件處理系統(tǒng)[D];北京交通大學(xué);2009年
本文編號(hào):2377017
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2377017.html