基于Hadoop平臺(tái)的分布式web日志分析系統(tǒng)的研究與實(shí)現(xiàn)

發(fā)布時(shí)間：2018-11-04 14:52

【摘要】：伴隨科技進(jìn)步以及互聯(lián)網(wǎng)日新月異的發(fā)展,互聯(lián)網(wǎng)與人們的生活聯(lián)系的越來越緊密。運(yùn)行于互聯(lián)網(wǎng)的網(wǎng)站每天會(huì)產(chǎn)生大量日志信息,人們的訪問記錄都保存在web日志中。分析日志數(shù)據(jù)成為了解網(wǎng)站運(yùn)營情況、用戶訪問規(guī)律等信息的重要手段,挖掘其中有價(jià)值的信息有利于企業(yè)為用戶提供更好更方便的服務(wù)。目前多數(shù)日志分析系統(tǒng)還是單機(jī)的,面對海量web日志數(shù)據(jù),無論是性能還是存儲(chǔ)容量都無法勝任。為了滿足大數(shù)據(jù)分析的需求,涌現(xiàn)了很多的數(shù)據(jù)處理方案,尤其是以Hadoop為代表的云計(jì)算技術(shù),強(qiáng)大的分布式存儲(chǔ)及計(jì)算能力,為海量web日志的存儲(chǔ)及分析提供了很好平臺(tái)。本文首先介紹了分布式技術(shù)的發(fā)展?fàn)顩r,同時(shí)對當(dāng)前web日志挖掘的背景做了描述。然后對Hadoop核心組件HDFS和MapReduce,Hive數(shù)據(jù)倉庫進(jìn)行研究。深入研究了 HDFS分布式文件系統(tǒng)下數(shù)據(jù)的存儲(chǔ)原理,數(shù)據(jù)的訪問模式和系統(tǒng)的容錯(cuò)機(jī)制和MapReduee并行計(jì)算框架的編程模型。然后為web日志分析系統(tǒng)建立合適的業(yè)務(wù)數(shù)據(jù)處理模型,并在Hadoop平臺(tái)上設(shè)計(jì)高效的web日志分析系統(tǒng)。系統(tǒng)主要包括日志存儲(chǔ)、日志收集、日志預(yù)處理、關(guān)鍵指標(biāo)統(tǒng)計(jì)、日志挖掘五個(gè)模塊。日志存儲(chǔ)采用HDFS與MySQL相結(jié)合的方式,HDFS存儲(chǔ)原始日志以及清洗后的日志。日志的預(yù)處理采用MapReduce并行化的方式對包含噪聲的數(shù)據(jù)清洗標(biāo)準(zhǔn)化。指標(biāo)統(tǒng)計(jì)使用Hive數(shù)據(jù)倉庫的HQL腳本對網(wǎng)站運(yùn)營情況進(jìn)行分析。日志挖掘使用在MapReduce平臺(tái)改進(jìn)的K-means算法對注冊用戶聚類分析,提高了算法在處理海量數(shù)據(jù)時(shí)的效率。最后通過系統(tǒng)測試證明,基于Hadoop的web日志分析系統(tǒng)在收集、處理、存儲(chǔ)、挖掘方面相比傳統(tǒng)單機(jī)處理有很大改進(jìn),不僅減少了開發(fā)人員工作量同時(shí)還提高了系統(tǒng)效率。
[Abstract]:With the progress of science and technology and the rapid development of the Internet, the Internet and people's lives are more and more closely linked. Web sites running on the Internet generate a lot of log information every day, and people's access records are kept in web logs. The analysis of log data becomes an important means to understand the website operation, user access rules and other information, mining valuable information is conducive to enterprises to provide users with better and more convenient services. At present, most log analysis systems are single machine. In the face of massive web log data, both performance and storage capacity are not competent. In order to meet the needs of big data's analysis, many data processing schemes have emerged, especially the cloud computing technology represented by Hadoop, and the powerful distributed storage and computing ability, which provides a good platform for the storage and analysis of massive web logs. This paper first introduces the development of distributed technology and describes the background of current web log mining. Then the HDFS and MapReduce,Hive data warehouse, the core components of Hadoop, are studied. The principle of data storage in HDFS distributed file system, the access mode of data, the fault-tolerant mechanism of the system and the programming model of MapReduee parallel computing framework are studied in detail. Then a suitable business data processing model is established for the web log analysis system, and an efficient web log analysis system is designed on the Hadoop platform. The system mainly includes five modules: log storage, log collection, log preprocessing, key index statistics and log mining. Log storage adopts the combination of HDFS and MySQL, and HDFS stores the original log and the cleaned log. Log preprocessing uses MapReduce parallelization to standardize data cleaning with noise. Index statistics using Hive data warehouse HQL script to analyze the operation of the site. Log mining uses the improved K-means algorithm in MapReduce platform to analyze the clustering of registered users, which improves the efficiency of the algorithm in dealing with massive data. Finally, it is proved by system test that the web log analysis system based on Hadoop has great improvement in collection, processing, storage and mining, which not only reduces the workload of developers, but also improves the efficiency of the system.
【學(xué)位授予單位】：西南石油大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2017
【分類號】：TP311.13

【參考文獻(xiàn)】

相關(guān)期刊論文前10條

1 何非;何克清;;大數(shù)據(jù)及其科學(xué)問題與方法的探討[J];武漢大學(xué)學(xué)報(bào)(理學(xué)版);2014年01期

2 余琦;凌捷;;基于HDFS的云存儲(chǔ)安全技術(shù)研究[J];計(jì)算機(jī)工程與設(shè)計(jì);2013年08期

3 高洪;楊慶平;黃震江;;基于Hadoop平臺(tái)的大數(shù)據(jù)分析關(guān)鍵技術(shù)標(biāo)準(zhǔn)化探討[J];信息技術(shù)與標(biāo)準(zhǔn)化;2013年05期

4 周婷;張君瑛;羅成;;基于Hadoop的K-means聚類算法的實(shí)現(xiàn)[J];計(jì)算機(jī)技術(shù)與發(fā)展;2013年07期

5 孟小峰;慈祥;;大數(shù)據(jù)管理:概念、技術(shù)與挑戰(zhàn)[J];計(jì)算機(jī)研究與發(fā)展;2013年01期

6 李國杰;程學(xué)旗;;大數(shù)據(jù)研究:未來科技及經(jīng)濟(jì)社會(huì)發(fā)展的重大戰(zhàn)略領(lǐng)域——大數(shù)據(jù)的研究現(xiàn)狀與科學(xué)思考[J];中國科學(xué)院院刊;2012年06期

7 李超;梁阿磊;管海兵;李小勇;;海量存儲(chǔ)系統(tǒng)的性能管理與監(jiān)測方法研究[J];計(jì)算機(jī)應(yīng)用與軟件;2012年07期

8 李建江;崔健;王聃;嚴(yán)林;黃義雙;;MapReduce并行編程模型研究綜述[J];電子學(xué)報(bào);2011年11期

9 劉永增;張曉景;李先毅;;基于Hadoop/Hive的web日志分析系統(tǒng)的設(shè)計(jì)[J];廣西大學(xué)學(xué)報(bào)(自然科學(xué)版);2011年S1期

10 張世樂;魏芳;費(fèi)仲超;;基于代理的互聯(lián)網(wǎng)用戶行為分析研究[J];計(jì)算機(jī)應(yīng)用與軟件;2011年08期

相關(guān)碩士學(xué)位論文前10條

1 蔡大威;基于Hadoop和Hama平臺(tái)的并行算法研究[D];浙江大學(xué);2013年

2 李鑫;Hadoop框架的擴(kuò)展和性能調(diào)優(yōu)[D];西安建筑科技大學(xué);2012年

3 周津;互聯(lián)網(wǎng)中的海量用戶行為挖掘算法研究[D];中國科學(xué)技術(shù)大學(xué);2011年

4 白云龍;基于Hadoop的數(shù)據(jù)挖掘算法研究與實(shí)現(xiàn)[D];北京郵電大學(xué);2011年

5 楊宸鑄;基于HADOOP的數(shù)據(jù)挖掘研究[D];重慶大學(xué);2010年

6 李應(yīng)安;基于MapReduce的聚類算法的并行化研究[D];中山大學(xué);2010年

7 曾理;Hadoop的重復(fù)數(shù)據(jù)清理模型研究與實(shí)現(xiàn)[D];南華大學(xué);2010年

8 張密密;MapReduce模型在Hadoop實(shí)現(xiàn)中的性能分析及改進(jìn)優(yōu)化[D];電子科技大學(xué);2010年

9 李亭楓;面向網(wǎng)絡(luò)用戶行為模式發(fā)現(xiàn)的數(shù)據(jù)挖掘技術(shù)探索[D];電子科技大學(xué);2010年

10 鄭韞e，

本文編號：2310163

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2310163.html

上一篇：基于反饋背景模型的城市道路交叉口前景目標(biāo)檢測
下一篇：基于深度學(xué)習(xí)的輻射圖像超分辨率重建方法

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于Hadoop平臺(tái)的分布式web日志分析系統(tǒng)的研究與實(shí)現(xiàn)