HBase分布式緩存策略的研究與設(shè)計(jì)
發(fā)布時(shí)間:2018-01-16 21:34
本文關(guān)鍵詞:HBase分布式緩存策略的研究與設(shè)計(jì) 出處:《北京交通大學(xué)》2017年碩士論文 論文類(lèi)型:學(xué)位論文
更多相關(guān)文章: HBase 分區(qū) 讀寫(xiě)性能 一致性哈希算法 緩存替換策略
【摘要】:隨著互聯(lián)網(wǎng)的飛速發(fā)展,大數(shù)據(jù)的價(jià)值也得到了越來(lái)越多的重視。作為大數(shù)據(jù)研究與應(yīng)用的基礎(chǔ)設(shè)施,大數(shù)據(jù)存儲(chǔ)系統(tǒng)顯得尤為重要,HBase便是其中一款典型的非關(guān)系型數(shù)據(jù)庫(kù)。當(dāng)前HBase仍然存在分區(qū)不均衡和緩存替換策略單一等問(wèn)題,對(duì)集群讀寫(xiě)性能造成了制約。論文針對(duì)這些問(wèn)題進(jìn)行研究,致力于優(yōu)化HBase的讀寫(xiě)性能。論文的研究工作得到了國(guó)家自然科學(xué)基金項(xiàng)目(No.61172072、61271308)、北京市自然科學(xué)基金項(xiàng)目(No.4112045)和高等學(xué)校博士學(xué)科點(diǎn)專(zhuān)項(xiàng)科研基金(No.20100009110002)的支持。論文的主要工作如下:(1)寫(xiě)緩存方面:在不分區(qū)的情況下,現(xiàn)有HBase很難發(fā)揮出分布式系統(tǒng)的優(yōu)點(diǎn)。即使采用了預(yù)分區(qū)技術(shù),也沒(méi)有一套對(duì)任何數(shù)據(jù)表存儲(chǔ)均適用的預(yù)分區(qū)方法以及一套能夠自適應(yīng)調(diào)整系統(tǒng)負(fù)載的方案。為了解決上述問(wèn)題,本文設(shè)計(jì)了一種兩階段分區(qū)方法。預(yù)分區(qū)階段,利用MD5的散列效果對(duì)RowKey重新進(jìn)行設(shè)計(jì)。自適應(yīng)分區(qū)階段,本文設(shè)計(jì)了一種RegionServer性能評(píng)價(jià)策略,依據(jù)該策略實(shí)現(xiàn)自適應(yīng)分區(qū)。該評(píng)價(jià)策略將層次分析和TOPSIS相結(jié)合,利用并改進(jìn)了一致性哈希算法,而且設(shè)計(jì)了一種新的數(shù)據(jù)結(jié)構(gòu)來(lái)實(shí)現(xiàn)改進(jìn)后的一致性哈希算法。(2)讀緩存方面:現(xiàn)有BlockCache的LRU緩存替換策略十分粗糙。它雖然將緩存分成多層,但是所有層均使用同一種緩存策略,即只根據(jù)數(shù)據(jù)最后一次更新時(shí)間的先后進(jìn)行緩存替換。本文將對(duì)每一層的緩存替換策略進(jìn)行進(jìn)一步的設(shè)計(jì):在Single層添加了對(duì)數(shù)據(jù)熱點(diǎn)的考慮,在Multi層添加了對(duì)Block大小的權(quán)衡,同時(shí)對(duì)Single層進(jìn)入Multi層的門(mén)限參數(shù)重新進(jìn)行規(guī)定,降低FULL GC發(fā)生的概率。另一方面,針對(duì)連續(xù)數(shù)據(jù)等緊密關(guān)系數(shù)據(jù)查詢(xún)速度降低的問(wèn)題,使用社區(qū)發(fā)現(xiàn)的思想設(shè)計(jì)了一個(gè)二級(jí)緩存來(lái)對(duì)其彌補(bǔ)。(3)本論文準(zhǔn)備了連續(xù)型數(shù)據(jù)、隨機(jī)型數(shù)據(jù)和集中型數(shù)據(jù)來(lái)模擬不同的實(shí)驗(yàn)情景,將本文設(shè)計(jì)的HBase系統(tǒng)應(yīng)用于同構(gòu)、異構(gòu)集群中,進(jìn)行讀寫(xiě)性能的測(cè)試,并與原HBase的測(cè)試結(jié)果進(jìn)行對(duì)比和分析。通過(guò)實(shí)驗(yàn)表明,本論文所給出的方案對(duì)原有HBase的讀寫(xiě)性能具有一定程度的提高,而且改進(jìn)后的HBase適用于絕大多數(shù)類(lèi)型的數(shù)據(jù)表,具有較好的適用性和穩(wěn)定性。
[Abstract]:With the rapid development of the Internet, big data's value has been paid more and more attention. As the infrastructure of big data's research and application, big data storage system is particularly important. HBase is one of the typical non-relational databases. Currently, there are still some problems in HBase, such as partition imbalance and single cache replacement strategy. The performance of reading and writing in cluster is restricted. In order to optimize the reading and writing performance of HBase, the research work of this paper has been obtained from the National Natural Science Foundation Project No. 61172072 / 61271308). Beijing Natural Science Foundation Project No. 4112045) and the Special Research Foundation for doctoral subject points in institutions of higher Learning No. 20100009110002). The main work of this paper is as follows: write cache aspect: without partitioning. It is difficult for the existing HBase to take advantage of distributed systems, even if pre-partitioning technology is used. There is not a set of prepartitioning methods that are applicable to any data table storage and a scheme to adjust the system load adaptively. In order to solve the above problems. In this paper, a two-stage partitioning method is designed. In the pre-partitioning stage, the RowKey is redesigned using the hash effect of MD5. In this paper, a RegionServer performance evaluation strategy is designed, according to which adaptive partitioning is realized. The evaluation strategy combines AHP with TOPSIS. The consistent hash algorithm is used and improved. Furthermore, a new data structure is designed to implement the improved consistency hash algorithm. Read the cache aspect: the existing BlockCache's LRU cache replacement strategy is rough, although it divides the cache into multiple layers. But all layers use the same caching policy. Cache replacement is only based on the last update time of data. This paper will further design the cache replacement strategy for each layer: the consideration of data hotspots is added in the Single layer. A tradeoff of the Block size is added to the Multi layer, while the threshold parameters of the Single layer entering the Multi layer are re-specified. Reduce the probability of FULL GC occurrence. On the other hand, for the continuous data closely related to the problem of data query speed. Using the idea of community discovery, a two-level cache is designed to compensate for it.) in this paper, continuous data, random data and centralized data are prepared to simulate different experimental scenarios. The HBase system designed in this paper is applied to the isomorphic heterogeneous cluster to test the read and write performance. The test results are compared and analyzed with the original HBase. The scheme presented in this paper can improve the reading and writing performance of the original HBase to a certain extent, and the improved HBase is suitable for most kinds of data tables and has good applicability and stability.
【學(xué)位授予單位】:北京交通大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類(lèi)號(hào)】:TP333
【參考文獻(xiàn)】
相關(guān)期刊論文 前7條
1 葛微;羅圣美;周文輝;趙,
本文編號(hào):1434930
本文鏈接:http://sikaile.net/kejilunwen/jisuanjikexuelunwen/1434930.html
最近更新
教材專(zhuān)著