基于hadoop的海量搜索日志分析平臺的設計和實現(xiàn)
[Abstract]:Since the end of the 20th century, with the growth of the Internet industry and the acceleration of the information process of human activities, people's information exchange is becoming more and more frequent, and how to carry out effective information retrieval has become one of the problems that people face. The emergence of search engine technology helps people out of the maze of information, realizes effective information retrieval, and greatly changes the way people work and live. At present, the research on search engine technology is no longer confined to itself, and the research on the behavior of network users has been paid more and more attention. This is because the systematic and in-depth research on the behavior of network users is conducive to capturing the explicit needs of users and discovering their hidden needs directly. Another challenge related to networking and informatization is how to deal with massive data. This is not only a great test to the storage mode of the traditional database server, but also a severe challenge to the computing performance of the CPU,IO of the server. Hadoop/Hive is a very suitable method and tool to solve this kind of problem in the field of current technology. Based on the above situation, through the reading and reference of a large number of documents, as well as the generation of search engine logs and the detailed analysis of common models, this paper designed an analysis platform for dealing with massive search logs. It includes four parts: data preprocessing module, data storage module, data analysis module and cluster management module. Among them, a set of algorithms based on user behavior pattern mining is designed to analyze and process the search engine log. In the platform monitoring module, the monitoring and management of the cluster is realized. Taking the flow of data mining as the train of thought, taking the massive data analysis tool Hadoop as the experimental platform, adopting the programming model of MapReduce I mapping / specification, and using the simple and practical HIVE and HBase massive database of SQL to deal with the massive log: at the same time, The mining pattern is decomposed into each distributed server for association matching, and then the mining results are combined to reduce the pressure of the bottleneck of network and server performance, and reflect the advantages of asynchronous mining and asynchronous data specification. Finally, the platform is verified by setting up the experimental environment. The data used are three search engine log samples (sample data, one-day data, monthly data) provided by Sogou Labs. The user search behavior is analyzed in detail from the aspects of user click number URL sort and user session analysis. At the same time the performance of the platform is optimized and the system running time before and after optimization is compared. The experimental data show that the log analysis platform designed in this paper has good stability and effectiveness.
【學位授予單位】:大連理工大學
【學位級別】:碩士
【學位授予年份】:2013
【分類號】:TP391.3
【參考文獻】
相關期刊論文 前10條
1 王建勇,單松巍,雷鳴,謝正茂,李曉明;海量Web搜索引擎系統(tǒng)中用戶行為的分布特征及其啟示[J];中國科學E輯:技術科學;2001年04期
2 楊文峰,李星;網(wǎng)絡搜索引擎的用戶查詢分析[J];計算機工程;2001年06期
3 鮑鈺,黃國興,張召;基于Web日志挖掘的網(wǎng)站結構優(yōu)化方法[J];計算機工程;2003年12期
4 王川;王大玲;于戈;馬海濤;劉鑫鋼;;基于用戶行為模型的搜索引擎[J];計算機工程;2008年04期
5 陳紅濤;楊放春;陳磊;;基于大規(guī)模中文搜索引擎的搜索日志挖掘[J];計算機應用研究;2008年06期
6 余慧佳;劉奕群;張敏;茹立云;馬少平;;基于大規(guī)模日志分析的搜索引擎用戶行為分析[J];中文信息學報;2007年01期
7 岑榮偉;劉奕群;張敏;茹立云;馬少平;;基于日志挖掘的搜索引擎用戶行為分析[J];中文信息學報;2010年03期
8 姚海燕;鄧小昭;;網(wǎng)絡用戶信息行為研究概述[J];情報探索;2010年02期
9 多雪松;張晶;高強;;基于Hadoop的海量數(shù)據(jù)管理系統(tǒng)[J];微計算機信息;2010年13期
10 崔林,宋瀚濤,龔永罡,陸玉昌;基于Web使用挖掘的個性化服務技術研究[J];計算機系統(tǒng)應用;2005年03期
相關碩士學位論文 前5條
1 張文峰;基于MapReduce模型的分布式計算平臺的原理與設計[D];華中科技大學;2010年
2 萬至臻;基于MapReduce模型的并行計算平臺的設計與實現(xiàn)[D];浙江大學;2008年
3 朱珠;基于Hadoop的海量數(shù)據(jù)處理模型研究和應用[D];北京郵電大學;2008年
4 李云桃;基于Hadoop的海量數(shù)據(jù)處理系統(tǒng)的設計與實現(xiàn)[D];哈爾濱工業(yè)大學;2009年
5 夏yN;Hadoop平臺下的作業(yè)調度算法研究與改進[D];華南理工大學;2010年
,本文編號:2314952
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2314952.html