基于Eucalyptus的Hadoop集群web日志分析系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)

發(fā)布時(shí)間：2018-06-18 08:44

本文選題：云計(jì)算 + Eucalyptus��；參考：《北京郵電大學(xué)》2016年碩士論文

【摘要】：隨著互聯(lián)網(wǎng)的飛速發(fā)展,Web日志的數(shù)量也變得越來越多,而Web日志中含有許多信息。通過分析日志可以獲取企業(yè)有價(jià)值的信息。針對(duì)目前數(shù)據(jù)量越來越多的Web日志,傳統(tǒng)的單機(jī)分析處理能力已經(jīng)達(dá)到了瓶頸。數(shù)據(jù)量一旦超過一定的大小,傳統(tǒng)的依靠單一節(jié)點(diǎn)的計(jì)算能力以及不能滿足需求。本文設(shè)計(jì)了基于Eucalyptus的Hadoop集群的Web日志分析系統(tǒng)。并實(shí)現(xiàn)了該系統(tǒng)。該系統(tǒng)利用云計(jì)算和分布式技術(shù)來分析和處理大規(guī)模的Web日志。測試結(jié)果表明,該系統(tǒng)可以大大提高系統(tǒng)的計(jì)算能力和運(yùn)行速度。首先,搭建了 Eucalyptus私有云平臺(tái)。結(jié)合Eucalyptus云平臺(tái)方便快速創(chuàng)建虛擬機(jī)和Hadoop集群分布式處理的優(yōu)點(diǎn),將Hadoop集群部署在Eucalyptus云平臺(tái)上。其次,使用MapReduce程序?qū)δ吃诰€教育網(wǎng)站W(wǎng)eb日志進(jìn)行分析處理。得到網(wǎng)站的相關(guān)指標(biāo)比如訪客數(shù)、瀏覽量、IP數(shù)、跳出率、平均訪問時(shí)長、流量來源、受訪頁面等,并且將分析結(jié)果通過可視化的形式呈現(xiàn)給用戶。另外,論文還使用改進(jìn)的并行化Apriori算法對(duì)Web日志進(jìn)行了關(guān)聯(lián)規(guī)則挖掘,得到網(wǎng)站各個(gè)頁面之間的相關(guān)性。網(wǎng)站管理和運(yùn)營人員可以通過日志分析結(jié)果指標(biāo)更好的了解網(wǎng)站。根據(jù)分析結(jié)果對(duì)網(wǎng)站結(jié)構(gòu)進(jìn)行調(diào)整,實(shí)施有效的市場推廣戰(zhàn)略,對(duì)用戶進(jìn)行個(gè)性化推薦等等。最后對(duì)分布式環(huán)境和單機(jī)環(huán)境分析日志性能進(jìn)行了測試比較。結(jié)果表明分布式環(huán)境下處理大量Web日志數(shù)據(jù)的性能遠(yuǎn)遠(yuǎn)高于單機(jī)環(huán)境。并對(duì)改進(jìn)的并行化的Apriori算法和單機(jī)的Apriori進(jìn)行了測試比較。結(jié)果表明改進(jìn)的并行化Apriori算法在運(yùn)行時(shí)間、CPU和內(nèi)存利用率上有更好的性能。
[Abstract]:With the rapid development of the Internet, the number of Web logs has become more and more, and the Web log contains a lot of information. Through the analysis of logs, the value of information can be obtained. For the more and more Web logs of the current data amount, the traditional single machine analysis processing capacity has reached the bottleneck. Once the amount of data is more than a certain size, This paper designs the Web log analysis system of Hadoop cluster based on Eucalyptus and implements the system. The system uses cloud computing and distributed technology to analyze and process large-scale Web logs. The test results show that the system can greatly improve the system. First, the Eucalyptus private cloud platform is built. Combined with the Eucalyptus cloud platform, the advantages of creating virtual machines and Hadoop cluster distributed processing are convenient and fast, and the Hadoop cluster is deployed on the Eucalyptus cloud platform. Secondly, the MapReduce program is used to analyze and process the Web log of an online education website. The relevant index of the station, such as the number of visitors, the amount of browsing, the IP number, the jump out rate, the average time of the visit, the source of the traffic, the page of the interview, etc., and the analysis results are presented to the users through the visual form. In addition, the paper also uses an improved parallel Apriori algorithm to mining the association rules for the Web log, and gets the phase between the pages of the web site. The website management and operators can understand the website better through the log analysis results. According to the results of the analysis, the website structure is adjusted, the effective marketing strategy is implemented, the user is personalized recommendation and so on. Finally, the performance of the distributed environment and the single machine environment analysis log is tested and compared. The performance of a large number of Web log data in the distributed environment is much higher than that in the single machine environment. The improved parallel Apriori algorithm and the single machine Apriori are tested and compared. The results show that the improved parallel Apriori algorithm has better performance in running time, CPU and memory utilization.
【學(xué)位授予單位】：北京郵電大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2016
【分類號(hào)】：TP311.13;TP393.09

【參考文獻(xiàn)】

相關(guān)期刊論文前5條

1 劉永增;張曉景;李先毅;;基于Hadoop/Hive的web日志分析系統(tǒng)的設(shè)計(jì)[J];廣西大學(xué)學(xué)報(bào)(自然科學(xué)版);2011年S1期

2 孫健;賈曉菁;;Google云計(jì)算平臺(tái)的技術(shù)架構(gòu)及對(duì)其成本的影響研究[J];電信科學(xué);2010年01期

3 郭本俊;王鵬;陳高云;黃健;;基于MPI的云計(jì)算模型[J];計(jì)算機(jī)工程;2009年24期

4 宋擒豹,沈鈞毅;Web日志的高效多能挖掘算法[J];計(jì)算機(jī)研究與發(fā)展;2001年03期

5 王繼成,潘金貴,張福炎;Web文本挖掘技術(shù)研究[J];計(jì)算機(jī)研究與發(fā)展;2000年05期

相關(guān)碩士學(xué)位論文前2條

1 寧立;基于數(shù)據(jù)倉庫的Web日志挖掘研究與應(yīng)用[D];湖北大學(xué);2012年

2 鄧自立;云計(jì)算中的網(wǎng)絡(luò)拓?fù)湓O(shè)計(jì)和Hadoop平臺(tái)研究[D];中國科學(xué)技術(shù)大學(xué);2009年

，

本文編號(hào)：2034877

資料下載