基于Hadoop平臺的分布式web日志分析系統(tǒng)的研究與實現(xiàn)
[Abstract]:With the progress of science and technology and the rapid development of the Internet, the Internet and people's lives are more and more closely linked. Web sites running on the Internet generate a lot of log information every day, and people's access records are kept in web logs. The analysis of log data becomes an important means to understand the website operation, user access rules and other information, mining valuable information is conducive to enterprises to provide users with better and more convenient services. At present, most log analysis systems are single machine. In the face of massive web log data, both performance and storage capacity are not competent. In order to meet the needs of big data's analysis, many data processing schemes have emerged, especially the cloud computing technology represented by Hadoop, and the powerful distributed storage and computing ability, which provides a good platform for the storage and analysis of massive web logs. This paper first introduces the development of distributed technology and describes the background of current web log mining. Then the HDFS and MapReduce,Hive data warehouse, the core components of Hadoop, are studied. The principle of data storage in HDFS distributed file system, the access mode of data, the fault-tolerant mechanism of the system and the programming model of MapReduee parallel computing framework are studied in detail. Then a suitable business data processing model is established for the web log analysis system, and an efficient web log analysis system is designed on the Hadoop platform. The system mainly includes five modules: log storage, log collection, log preprocessing, key index statistics and log mining. Log storage adopts the combination of HDFS and MySQL, and HDFS stores the original log and the cleaned log. Log preprocessing uses MapReduce parallelization to standardize data cleaning with noise. Index statistics using Hive data warehouse HQL script to analyze the operation of the site. Log mining uses the improved K-means algorithm in MapReduce platform to analyze the clustering of registered users, which improves the efficiency of the algorithm in dealing with massive data. Finally, it is proved by system test that the web log analysis system based on Hadoop has great improvement in collection, processing, storage and mining, which not only reduces the workload of developers, but also improves the efficiency of the system.
【學位授予單位】:西南石油大學
【學位級別】:碩士
【學位授予年份】:2017
【分類號】:TP311.13
【參考文獻】
相關(guān)期刊論文 前10條
1 何非;何克清;;大數(shù)據(jù)及其科學問題與方法的探討[J];武漢大學學報(理學版);2014年01期
2 余琦;凌捷;;基于HDFS的云存儲安全技術(shù)研究[J];計算機工程與設(shè)計;2013年08期
3 高洪;楊慶平;黃震江;;基于Hadoop平臺的大數(shù)據(jù)分析關(guān)鍵技術(shù)標準化探討[J];信息技術(shù)與標準化;2013年05期
4 周婷;張君瑛;羅成;;基于Hadoop的K-means聚類算法的實現(xiàn)[J];計算機技術(shù)與發(fā)展;2013年07期
5 孟小峰;慈祥;;大數(shù)據(jù)管理:概念、技術(shù)與挑戰(zhàn)[J];計算機研究與發(fā)展;2013年01期
6 李國杰;程學旗;;大數(shù)據(jù)研究:未來科技及經(jīng)濟社會發(fā)展的重大戰(zhàn)略領(lǐng)域——大數(shù)據(jù)的研究現(xiàn)狀與科學思考[J];中國科學院院刊;2012年06期
7 李超;梁阿磊;管海兵;李小勇;;海量存儲系統(tǒng)的性能管理與監(jiān)測方法研究[J];計算機應用與軟件;2012年07期
8 李建江;崔健;王聃;嚴林;黃義雙;;MapReduce并行編程模型研究綜述[J];電子學報;2011年11期
9 劉永增;張曉景;李先毅;;基于Hadoop/Hive的web日志分析系統(tǒng)的設(shè)計[J];廣西大學學報(自然科學版);2011年S1期
10 張世樂;魏芳;費仲超;;基于代理的互聯(lián)網(wǎng)用戶行為分析研究[J];計算機應用與軟件;2011年08期
相關(guān)碩士學位論文 前10條
1 蔡大威;基于Hadoop和Hama平臺的并行算法研究[D];浙江大學;2013年
2 李鑫;Hadoop框架的擴展和性能調(diào)優(yōu)[D];西安建筑科技大學;2012年
3 周津;互聯(lián)網(wǎng)中的海量用戶行為挖掘算法研究[D];中國科學技術(shù)大學;2011年
4 白云龍;基于Hadoop的數(shù)據(jù)挖掘算法研究與實現(xiàn)[D];北京郵電大學;2011年
5 楊宸鑄;基于HADOOP的數(shù)據(jù)挖掘研究[D];重慶大學;2010年
6 李應安;基于MapReduce的聚類算法的并行化研究[D];中山大學;2010年
7 曾理;Hadoop的重復數(shù)據(jù)清理模型研究與實現(xiàn)[D];南華大學;2010年
8 張密密;MapReduce模型在Hadoop實現(xiàn)中的性能分析及改進優(yōu)化[D];電子科技大學;2010年
9 李亭楓;面向網(wǎng)絡(luò)用戶行為模式發(fā)現(xiàn)的數(shù)據(jù)挖掘技術(shù)探索[D];電子科技大學;2010年
10 鄭韞e,
本文編號:2310163
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2310163.html