基于HDFS的小文件處理與相關(guān)MapReduce計(jì)算模型性能的優(yōu)化與改進(jìn)
發(fā)布時(shí)間:2018-06-06 04:10
本文選題:Hadoop + HDFS ; 參考:《吉林大學(xué)》2012年碩士論文
【摘要】:隨著Internet的飛速發(fā)展,數(shù)據(jù)呈爆炸式的增長,傳統(tǒng)的技術(shù)架構(gòu)已經(jīng)越來越不能適應(yīng)當(dāng)前海量數(shù)據(jù)的需求。因此,有關(guān)海量數(shù)據(jù)的處理與存儲(chǔ)成為時(shí)下研究的熱潮。DongCutting等人通過借鑒google的論文開發(fā)了hadoop這個(gè)分布式計(jì)算平臺(tái),來完成有關(guān)大量搜索引擎的索引計(jì)算。雖然hadoop本身被設(shè)計(jì)為處理流式的大文件,但是隨著hadoop應(yīng)用的不斷推廣,現(xiàn)在各行各業(yè),各個(gè)領(lǐng)域都在使用hadoop做大量的計(jì)算,這樣就導(dǎo)致需求被擴(kuò)大。小文件處理成為hadoop平臺(tái)的一個(gè)瓶頸。 本文針對(duì)hadoop平臺(tái)處理小文件,通過研究目前解決方案,提出了自己的解決方法。小文件是指文件大小小于HDFS上的塊(block)大。ㄒ话銥64M)的文件。大量的小文件會(huì)嚴(yán)重影響hadoop的性能和其擴(kuò)展性。原因主要有兩點(diǎn),一是在HDFS中,任何block,文件或者目錄在內(nèi)存中均以對(duì)象的形式存儲(chǔ),每個(gè)對(duì)象約占內(nèi)存150byte,如果有1千萬個(gè)小文件,則namenode需要2G空間(存兩份),,如果小文件數(shù)量增至1億個(gè),則namenode需要20G空間。小文件耗費(fèi)了namenode大量的內(nèi)存空間,這使namenode的內(nèi)存容量嚴(yán)重制約集群的擴(kuò)展和其應(yīng)用。其次,訪問大量小文件速度遠(yuǎn)遠(yuǎn)小于訪問幾個(gè)大文件。HDFS最初是為流式訪問大文件開發(fā)的,如果訪問大量小文件,需要不斷的從一個(gè)datanode跳到另一個(gè)datanode,嚴(yán)重影響性能。最后,處理大量小文件速度遠(yuǎn)遠(yuǎn)小于處理同等大小的大文件的速度。每一個(gè)小文件要占用一個(gè)slot,而task啟動(dòng)將耗費(fèi)大量時(shí)間甚至大部分時(shí)間都耗費(fèi)在啟動(dòng)task和釋放task上。 本文通過建立一個(gè)新的基于hdfs的頂層文件系統(tǒng)HSF來解決小文件的儲(chǔ)存管理。HSF小文件文件系統(tǒng),通過對(duì)小文件進(jìn)行分類,不同的小文件采取不同的解決方式。針對(duì)本身大小就非常小的文件(例如圖片),采用SequenceFile作為容器,來對(duì)小文件進(jìn)行合并,并建立高效的索引機(jī)制來完成用戶對(duì)原先的小文件進(jìn)行的隨機(jī)訪問;對(duì)可以合并的文件,直接進(jìn)行合并,同樣建立索引來解決小文件的隨機(jī)訪問。本文對(duì)小文件隨機(jī)訪問的索引機(jī)制,采用了二級(jí)索引的方式,索引鍵為hash值,并且采用了緩存機(jī)制,在內(nèi)存中適當(dāng)保存索引表,這樣當(dāng)訪問同一合并后的文件中的小文件時(shí),就會(huì)提高小文件隨機(jī)訪問的效率,進(jìn)而解決小文件給hadoop系統(tǒng)帶來的問題。 文章實(shí)驗(yàn)部分,對(duì)本文所說的文件系統(tǒng)進(jìn)行了系統(tǒng)的測試,利用了不同種類的數(shù)據(jù),建立不同的實(shí)驗(yàn)用例。包括直接讀取小文件和讀取合并后的小文件,這將對(duì)二進(jìn)制圖片文件和文本文件分開進(jìn)行實(shí)驗(yàn),驗(yàn)證其從本地文件系統(tǒng)讀和hdfs中讀的效率都呈現(xiàn)線性增長,不會(huì)因?yàn)閿?shù)據(jù)量增大而影響系統(tǒng)的正常運(yùn)行;采用MapReduce提供的樣例程序WordCount,對(duì)合并后的文件和未合并的小文件進(jìn)行對(duì)比,采用的是文本文件進(jìn)行的實(shí)驗(yàn),這個(gè)實(shí)驗(yàn)充分驗(yàn)證本文件系統(tǒng)適用于MapReduce計(jì)算模型;隨機(jī)讀取小文件,對(duì)二進(jìn)制圖片文件和文本文件分開進(jìn)行實(shí)驗(yàn),本實(shí)驗(yàn)驗(yàn)證了本系統(tǒng)小文件隨機(jī)訪問的高效性,要比Hadoop本身的歸檔har文件系統(tǒng)性能高。
[Abstract]:With the rapid development of Internet, the data show an explosive growth, the traditional technology architecture has become more and more unable to adapt to the demand of the current mass data. Therefore, the processing and storage of massive data has become an upsurge of current research..DongCutting and others have developed the distributed computing platform by using the paper of Google to complete the distributed computing platform. As an index calculation about a large number of search engines, although Hadoop itself is designed as a large file for processing flow, with the continuous promotion of Hadoop applications, all walks of life now use Hadoop to do a lot of computing, which leads to the expansion of demand. Small file processing has become a bottleneck for the Hadoop platform.
In this paper, Hadoop platform for processing small files, through the study of the current solution, put forward their own solution. Small files are file size less than the size of the HDFS (block) size (generally 64M) files. A large number of small files will seriously affect the performance and scalability of Hadoop. The main reasons are two, one is in HDFS, any block Files or directories are stored in the form of objects in memory, each object is about 150byte of memory. If there are 10 million small files, namenode needs 2G space (two copies), if the number of small files is increased to 100 million, the namenode needs 20G space. Small files consume a large amount of memory in namenode, which makes the memory of namenode strict. Second, the speed of accessing a large number of small files is far less than the access to a few large files..HDFS was initially developed for a large file. If a large number of small files were accessed, it needed to jump from one datanode to another datanode, and the performance was seriously affected. Finally, the speed of processing a large number of small files was far smaller. The speed of dealing with large files of equal size. Each small file takes up a slot, and task startup will spend a lot of time and even most of the time spent on starting task and releasing task.
In this paper, a new HDFS based top layer file system (HSF) is established to solve the small file storage and management of.HSF small file file system. Through the classification of small files, different small files take different solutions. In view of the small files in their own size (such as pictures), SequenceFile is used as a container for the small text. The components are merged, and the efficient index mechanism is set up to complete the random access to the original small files; to merge the files that can be merged directly, and to establish the cable to solve the random access of the small files. In this paper, the index mechanism of the random access of small files is used in the way of two level index, and the index key is hash value. And using the caching mechanism, the index table is properly stored in the memory, so when the small files in the same merged file are accessed, the efficiency of the random access of the small files will be improved, and the problems brought to the Hadoop system by the small files are solved.
In the experiment part, the paper systematically tests the file system described in this paper, and uses different kinds of data to establish different experimental cases, including reading small files directly and reading the merged small files, which will separate the binary picture files and text files into the experiment, and verify that it is read from the local file system and HDFS. The efficiency of reading has a linear growth, which does not affect the normal operation of the system because of the increase of data. The sample program WordCount provided by MapReduce compares the merged files with the unmerged small files, and uses the experiment of the text file, which fully validating that the file system is suitable for the MapReduce meter. Calculation model; read the small files randomly, experiment on binary image files and text files separately. This experiment verifies the efficiency of the random access of the small file in this system, which is higher than the Hadoop file system of the har file system itself.
【學(xué)位授予單位】:吉林大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2012
【分類號(hào)】:TP338.8
【引證文獻(xiàn)】
相關(guān)碩士學(xué)位論文 前4條
1 趙少鋒;云存儲(chǔ)系統(tǒng)關(guān)鍵技術(shù)研究[D];鄭州大學(xué);2013年
2 代萬能;倒排索引技術(shù)在Hadoop平臺(tái)上的研究與實(shí)現(xiàn)[D];電子科技大學(xué);2013年
3 張興;基于Hadoop的云存儲(chǔ)平臺(tái)的研究與實(shí)現(xiàn)[D];電子科技大學(xué);2013年
4 張丹;HDFS中文件存儲(chǔ)優(yōu)化的相關(guān)技術(shù)研究[D];南京師范大學(xué);2013年
本文編號(hào):1985012
本文鏈接:http://sikaile.net/kejilunwen/jisuanjikexuelunwen/1985012.html
最近更新
教材專著