基于Hadoop平臺(tái)的分布式web日志分析系統(tǒng)的研究與實(shí)現(xiàn)
[Abstract]:With the progress of science and technology and the rapid development of the Internet, the Internet and people's lives are more and more closely linked. Web sites running on the Internet generate a lot of log information every day, and people's access records are kept in web logs. The analysis of log data becomes an important means to understand the website operation, user access rules and other information, mining valuable information is conducive to enterprises to provide users with better and more convenient services. At present, most log analysis systems are single machine. In the face of massive web log data, both performance and storage capacity are not competent. In order to meet the needs of big data's analysis, many data processing schemes have emerged, especially the cloud computing technology represented by Hadoop, and the powerful distributed storage and computing ability, which provides a good platform for the storage and analysis of massive web logs. This paper first introduces the development of distributed technology and describes the background of current web log mining. Then the HDFS and MapReduce,Hive data warehouse, the core components of Hadoop, are studied. The principle of data storage in HDFS distributed file system, the access mode of data, the fault-tolerant mechanism of the system and the programming model of MapReduee parallel computing framework are studied in detail. Then a suitable business data processing model is established for the web log analysis system, and an efficient web log analysis system is designed on the Hadoop platform. The system mainly includes five modules: log storage, log collection, log preprocessing, key index statistics and log mining. Log storage adopts the combination of HDFS and MySQL, and HDFS stores the original log and the cleaned log. Log preprocessing uses MapReduce parallelization to standardize data cleaning with noise. Index statistics using Hive data warehouse HQL script to analyze the operation of the site. Log mining uses the improved K-means algorithm in MapReduce platform to analyze the clustering of registered users, which improves the efficiency of the algorithm in dealing with massive data. Finally, it is proved by system test that the web log analysis system based on Hadoop has great improvement in collection, processing, storage and mining, which not only reduces the workload of developers, but also improves the efficiency of the system.
【學(xué)位授予單位】:西南石油大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2017
【分類號】:TP311.13
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 何非;何克清;;大數(shù)據(jù)及其科學(xué)問題與方法的探討[J];武漢大學(xué)學(xué)報(bào)(理學(xué)版);2014年01期
2 余琦;凌捷;;基于HDFS的云存儲(chǔ)安全技術(shù)研究[J];計(jì)算機(jī)工程與設(shè)計(jì);2013年08期
3 高洪;楊慶平;黃震江;;基于Hadoop平臺(tái)的大數(shù)據(jù)分析關(guān)鍵技術(shù)標(biāo)準(zhǔn)化探討[J];信息技術(shù)與標(biāo)準(zhǔn)化;2013年05期
4 周婷;張君瑛;羅成;;基于Hadoop的K-means聚類算法的實(shí)現(xiàn)[J];計(jì)算機(jī)技術(shù)與發(fā)展;2013年07期
5 孟小峰;慈祥;;大數(shù)據(jù)管理:概念、技術(shù)與挑戰(zhàn)[J];計(jì)算機(jī)研究與發(fā)展;2013年01期
6 李國杰;程學(xué)旗;;大數(shù)據(jù)研究:未來科技及經(jīng)濟(jì)社會(huì)發(fā)展的重大戰(zhàn)略領(lǐng)域——大數(shù)據(jù)的研究現(xiàn)狀與科學(xué)思考[J];中國科學(xué)院院刊;2012年06期
7 李超;梁阿磊;管海兵;李小勇;;海量存儲(chǔ)系統(tǒng)的性能管理與監(jiān)測方法研究[J];計(jì)算機(jī)應(yīng)用與軟件;2012年07期
8 李建江;崔健;王聃;嚴(yán)林;黃義雙;;MapReduce并行編程模型研究綜述[J];電子學(xué)報(bào);2011年11期
9 劉永增;張曉景;李先毅;;基于Hadoop/Hive的web日志分析系統(tǒng)的設(shè)計(jì)[J];廣西大學(xué)學(xué)報(bào)(自然科學(xué)版);2011年S1期
10 張世樂;魏芳;費(fèi)仲超;;基于代理的互聯(lián)網(wǎng)用戶行為分析研究[J];計(jì)算機(jī)應(yīng)用與軟件;2011年08期
相關(guān)碩士學(xué)位論文 前10條
1 蔡大威;基于Hadoop和Hama平臺(tái)的并行算法研究[D];浙江大學(xué);2013年
2 李鑫;Hadoop框架的擴(kuò)展和性能調(diào)優(yōu)[D];西安建筑科技大學(xué);2012年
3 周津;互聯(lián)網(wǎng)中的海量用戶行為挖掘算法研究[D];中國科學(xué)技術(shù)大學(xué);2011年
4 白云龍;基于Hadoop的數(shù)據(jù)挖掘算法研究與實(shí)現(xiàn)[D];北京郵電大學(xué);2011年
5 楊宸鑄;基于HADOOP的數(shù)據(jù)挖掘研究[D];重慶大學(xué);2010年
6 李應(yīng)安;基于MapReduce的聚類算法的并行化研究[D];中山大學(xué);2010年
7 曾理;Hadoop的重復(fù)數(shù)據(jù)清理模型研究與實(shí)現(xiàn)[D];南華大學(xué);2010年
8 張密密;MapReduce模型在Hadoop實(shí)現(xiàn)中的性能分析及改進(jìn)優(yōu)化[D];電子科技大學(xué);2010年
9 李亭楓;面向網(wǎng)絡(luò)用戶行為模式發(fā)現(xiàn)的數(shù)據(jù)挖掘技術(shù)探索[D];電子科技大學(xué);2010年
10 鄭韞e,
本文編號:2310163
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2310163.html