基于Hadoop的Web日志存儲(chǔ)及預(yù)處理優(yōu)化研究
本文關(guān)鍵詞:基于Hadoop的Web日志存儲(chǔ)及預(yù)處理優(yōu)化研究 出處:《河北工程大學(xué)》2016年碩士論文 論文類型:學(xué)位論文
更多相關(guān)文章: Web日志預(yù)處理 Hadoop HBase負(fù)載均衡 MapReduce
【摘要】:互聯(lián)網(wǎng)、移動(dòng)互聯(lián)網(wǎng)等技術(shù)的發(fā)展,使得服務(wù)器上的Web日志急劇膨脹。Web日志記錄了上網(wǎng)用戶訪問(wèn)Web頁(yè)面的瀏覽行為,對(duì)網(wǎng)站建設(shè)和提供精準(zhǔn)服務(wù)具有重要的指導(dǎo)意義。但是,原始Web日志文件中數(shù)據(jù)的通常是不完整、冗余甚至錯(cuò)誤的,直接使用這些數(shù)據(jù)進(jìn)行日志分析非常困難,而且有可能得到錯(cuò)誤的結(jié)果,因此,對(duì)Web日志數(shù)據(jù)進(jìn)行預(yù)處理是很有必要的。同時(shí),考慮到傳統(tǒng)關(guān)系數(shù)據(jù)庫(kù)存儲(chǔ)的約束和單節(jié)點(diǎn)數(shù)據(jù)處理方式的局限性,本文使用Hadoop的分布式處理平臺(tái)對(duì)Web日志數(shù)據(jù)進(jìn)行存儲(chǔ)和預(yù)處理操作,主要內(nèi)容包括:(1)Web日志數(shù)據(jù)存儲(chǔ)面對(duì)海量Web日志的急劇增長(zhǎng),傳統(tǒng)存儲(chǔ)技術(shù)面臨建設(shè)成本高、運(yùn)維復(fù)雜、擴(kuò)展性有限等問(wèn)題,而現(xiàn)在流行的云數(shù)據(jù)庫(kù)具有動(dòng)態(tài)可擴(kuò)展、高伸縮性、高吞吐性能、低成本等優(yōu)勢(shì),因此,本課題考慮將Web日志存儲(chǔ)到Hadoop數(shù)據(jù)庫(kù)HBase中,充分利用集群的分布式處理優(yōu)勢(shì)。(2)HBase負(fù)載均衡優(yōu)化數(shù)據(jù)在HBase中的存儲(chǔ)方式在很大程度上左右著整個(gè)集群的性能,直接影響著后續(xù)讀取操作的效率。當(dāng)MapReduce讀取HBase中Web日志數(shù)據(jù)時(shí)可能會(huì)造成訪問(wèn)“熱點(diǎn)”問(wèn)題,本文針對(duì)這種情況提出一種改進(jìn)的負(fù)載均衡算法即HBase基于子表限制的負(fù)載均衡算法,在子表分配過(guò)程中除了考慮HRegionServer的負(fù)載情況外,還考慮到切割子表region的分配情況,從而實(shí)現(xiàn)最大程度上的集群負(fù)載均衡。(3)用MapReduce對(duì)Web日志進(jìn)行預(yù)處理Web日志預(yù)處理操作關(guān)系到Web挖掘的質(zhì)量,而單一節(jié)點(diǎn)的計(jì)算能力在處理大規(guī)模增長(zhǎng)的Web日志上逐漸顯露出弊端,MapReduce支持大規(guī)模集群操作,本文在分析Web日志預(yù)處理過(guò)程后,從HBase中讀取數(shù)據(jù),使用MapReduce計(jì)算模型處理Web日志的預(yù)處理操作。通過(guò)對(duì)比實(shí)驗(yàn),驗(yàn)證了優(yōu)化后的HBase負(fù)載均衡算法在適當(dāng)集群環(huán)境中可以有效解決負(fù)載訪問(wèn)失衡問(wèn)題,以及驗(yàn)證了MapReduce在處理Web日志預(yù)處理過(guò)程的高效性。最后,本文對(duì)預(yù)處理算法進(jìn)行優(yōu)化,并驗(yàn)證優(yōu)化后算法的高效性。
[Abstract]:With the development of Internet, mobile Internet and other technologies, the Web log on the server expands rapidly. The web log records the browsing behavior of the users accessing the Web page. It has important guiding significance for website construction and providing accurate service. However, the data in the original Web log file is usually incomplete, redundant and even wrong. It is very difficult to use this data directly for log analysis, and it is possible to get wrong results, so it is necessary to preprocess the Web log data. At the same time. Considering the constraints of traditional relational database storage and the limitation of single node data processing, this paper uses the distributed processing platform of Hadoop to store and preprocess the Web log data. The main contents include the rapid growth of the mass Web log data storage and the problems of the traditional storage technology such as high construction cost, complex operation and maintenance, limited expansibility and so on. Now the popular cloud database has the advantages of dynamic extensibility, high scalability, high throughput, low cost and so on. Therefore, this paper considers storing Web logs in Hadoop database HBase. Taking full advantage of the distributed processing advantage of cluster, the storage mode of optimized data of HBASE load balance in HBase greatly affects the performance of the whole cluster. It directly affects the efficiency of subsequent read operations. When MapReduce reads Web log data in HBase, it may cause access "hot spot" problems. In this paper, an improved load balancing algorithm named HBase based on sub-table constraints is proposed. In addition to considering the load of HRegionServer, the distribution of region in cutting subtable is also considered in the process of subtable allocation. Thus, to achieve maximum cluster load balancing. 3) using MapReduce to preprocess Web logs, Web log preprocessing operations are related to the quality of Web mining. However, the computing power of a single node has gradually revealed its disadvantages in dealing with large-scale Web logs. MapReduce supports large-scale cluster operations. This paper analyzes the preprocessing process of Web logs. Read the data from HBase, use the MapReduce computing model to deal with the pre-processing operation of Web log. It is verified that the optimized HBase load balancing algorithm can effectively solve the load access imbalance problem in the appropriate cluster environment. Finally, this paper optimizes the preprocessing algorithm and verifies the efficiency of the optimized algorithm.
【學(xué)位授予單位】:河北工程大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2016
【分類號(hào)】:TP311.13
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 逄利華;張錦春;;基于Hadoop的分布式數(shù)據(jù)庫(kù)系統(tǒng)[J];辦公自動(dòng)化;2014年05期
2 毛嚴(yán)奇;彭沛夫;;基于MapReduce的Web日志挖掘預(yù)處理[J];計(jì)算機(jī)與現(xiàn)代化;2013年09期
3 鄭立山;滕少華;;改進(jìn)的頁(yè)面與時(shí)間閾值的會(huì)話識(shí)別法[J];江西師范大學(xué)學(xué)報(bào)(自然科學(xué)版);2012年04期
4 李建江;崔健;王聃;嚴(yán)林;黃義雙;;MapReduce并行編程模型研究綜述[J];電子學(xué)報(bào);2011年11期
5 夏秀峰;王宇;;一種基于個(gè)性化推薦的用戶訪問(wèn)路徑補(bǔ)全算法[J];計(jì)算機(jī)應(yīng)用與軟件;2011年02期
6 周愛(ài)武;程博;李孫長(zhǎng);夏松;;Web日志挖掘中的會(huì)話識(shí)別方法[J];計(jì)算機(jī)工程與設(shè)計(jì);2010年05期
7 黃金晶;趙雷;楊季文;;Web會(huì)話構(gòu)造中基于多窗口的路徑補(bǔ)充[J];計(jì)算機(jī)應(yīng)用與軟件;2009年07期
8 殷賢亮;張為;;Web使用挖掘中的一種改進(jìn)的會(huì)話識(shí)別方法[J];華中科技大學(xué)學(xué)報(bào)(自然科學(xué)版);2006年07期
9 周增國(guó);龐有軍;;Cookie技術(shù)在Web日志挖掘預(yù)處理中的應(yīng)用[J];大連大學(xué)學(xué)報(bào);2006年02期
10 吳強(qiáng);梁繼民;楊萬(wàn)海;;Web日志挖掘預(yù)處理中的用戶識(shí)別技術(shù)[J];計(jì)算機(jī)科學(xué);2002年04期
相關(guān)碩士學(xué)位論文 前6條
1 康毅;HBase大對(duì)象存儲(chǔ)方案的設(shè)計(jì)與實(shí)現(xiàn)[D];南京大學(xué);2013年
2 劉沖;MapReduce作業(yè)調(diào)度算法研究[D];哈爾濱工程大學(xué);2013年
3 徐娟娟;基于NoSQL的Web日志分析系統(tǒng)的設(shè)計(jì)[D];安徽理工大學(xué);2012年
4 高薊超;Hadoop平臺(tái)存儲(chǔ)策略的研究與優(yōu)化[D];北京交通大學(xué);2012年
5 陶韜;云計(jì)算環(huán)境下基于MapReduce的資源調(diào)度模型和算法研究[D];大連海事大學(xué);2012年
6 宋愛(ài)青;基于Hadoop的日志分析系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)[D];中國(guó)地質(zhì)大學(xué)(北京);2012年
,本文編號(hào):1367346
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1367346.html