基于hadoop的海量搜索日志分析平臺的設計和實現(xiàn)

發(fā)布時間：2018-11-06 17:21

【摘要】：自20世紀末期以來,隨著互聯(lián)網(wǎng)行業(yè)的增長和人類活動信息化進程的加速,人們的信息交流日趨頻繁,如何進行有效的信息檢索也隨之成為人們面臨的難題之一。搜索引擎技術的出現(xiàn)幫助人們走出了信息的迷宮,實現(xiàn)了有效的信息檢索,極大的改變了人們工作和生活的方式。目前,對搜索引擎技術的研究已不再僅僅局限于其本身,對網(wǎng)絡用戶行為的研究也越來越被關注。這是因為對網(wǎng)絡用戶行為進行系統(tǒng)深入的研究,有利于直接捕捉用戶的顯性需求并發(fā)掘其隱性需求。與網(wǎng)絡和信息化相關的另一個挑戰(zhàn)是對如何應對海量數(shù)據(jù)的處理。這不僅對傳統(tǒng)數(shù)據(jù)庫服務器的存儲模式是一種巨大的考驗,同時對服務器的CPU、IO的計算性能也是嚴峻的挑戰(zhàn),而Hadoop/Hive是現(xiàn)技術領域解決這類問題的非常合適的方法和工具。基于以上現(xiàn)狀,通過對大量文獻的閱讀和參考,以及對搜索引擎日志的產(chǎn)生和常見模型進行的詳細分析,論文設計了一個用于處理海量搜索日志的分析平臺。具體包括：數(shù)據(jù)采集預處理模塊、數(shù)據(jù)存儲模塊、數(shù)據(jù)分析模塊和集群管理模塊四部分。其中,設計了一套基于用戶行為模式挖掘的算法來對搜索引擎的日志進行分析和處理；在平臺監(jiān)控模塊中,實現(xiàn)了對于集群的監(jiān)控和竹理。以數(shù)據(jù)挖掘的流程為思路,以海量數(shù)據(jù)分析工具Hadoop為實驗平臺,采用MapReduce I映射/規(guī)約的編程模型,并采用簡單實用的類SQL的HIVE和HBase的海量數(shù)據(jù)庫來處理海量日志：同時,將挖掘模式分解在各分布式服務器進行關聯(lián)匹配,然后將挖掘結果合成,由此實現(xiàn)減輕網(wǎng)絡和服務器性能的這-瓶頸的壓力,體現(xiàn)異步挖掘和異步數(shù)據(jù)規(guī)約的優(yōu)勢；最后通過搭建實驗環(huán)境來驗證本平臺。采用的數(shù)據(jù)是搜狗實驗室提供三個的搜索引擎的日志樣本(樣本數(shù)據(jù)、單日數(shù)據(jù)、月度數(shù)據(jù)),根據(jù)樣本分別從用戶查詢主題、用戶點擊數(shù)與URL排序和用戶會話分析等兒個方面對用戶檢索行為進行詳細的分析,同時還對平臺進行了性能的優(yōu)化,對比優(yōu)化前后的系統(tǒng)運行用時。通過實驗數(shù)據(jù)表明論文設計的日志分析平臺具有良好的穩(wěn)定性和有效性。
[Abstract]:Since the end of the 20th century, with the growth of the Internet industry and the acceleration of the information process of human activities, people's information exchange is becoming more and more frequent, and how to carry out effective information retrieval has become one of the problems that people face. The emergence of search engine technology helps people out of the maze of information, realizes effective information retrieval, and greatly changes the way people work and live. At present, the research on search engine technology is no longer confined to itself, and the research on the behavior of network users has been paid more and more attention. This is because the systematic and in-depth research on the behavior of network users is conducive to capturing the explicit needs of users and discovering their hidden needs directly. Another challenge related to networking and informatization is how to deal with massive data. This is not only a great test to the storage mode of the traditional database server, but also a severe challenge to the computing performance of the CPU,IO of the server. Hadoop/Hive is a very suitable method and tool to solve this kind of problem in the field of current technology. Based on the above situation, through the reading and reference of a large number of documents, as well as the generation of search engine logs and the detailed analysis of common models, this paper designed an analysis platform for dealing with massive search logs. It includes four parts: data preprocessing module, data storage module, data analysis module and cluster management module. Among them, a set of algorithms based on user behavior pattern mining is designed to analyze and process the search engine log. In the platform monitoring module, the monitoring and management of the cluster is realized. Taking the flow of data mining as the train of thought, taking the massive data analysis tool Hadoop as the experimental platform, adopting the programming model of MapReduce I mapping / specification, and using the simple and practical HIVE and HBase massive database of SQL to deal with the massive log: at the same time, The mining pattern is decomposed into each distributed server for association matching, and then the mining results are combined to reduce the pressure of the bottleneck of network and server performance, and reflect the advantages of asynchronous mining and asynchronous data specification. Finally, the platform is verified by setting up the experimental environment. The data used are three search engine log samples (sample data, one-day data, monthly data) provided by Sogou Labs. The user search behavior is analyzed in detail from the aspects of user click number URL sort and user session analysis. At the same time the performance of the platform is optimized and the system running time before and after optimization is compared. The experimental data show that the log analysis platform designed in this paper has good stability and effectiveness.
【學位授予單位】：大連理工大學
【學位級別】：碩士
【學位授予年份】：2013
【分類號】：TP391.3

【參考文獻】

相關期刊論文前10條

1 王建勇,單松巍,雷鳴,謝正茂,李曉明;海量Web搜索引擎系統(tǒng)中用戶行為的分布特征及其啟示[J];中國科學E輯:技術科學;2001年04期

2 楊文峰,李星;網(wǎng)絡搜索引擎的用戶查詢分析[J];計算機工程;2001年06期

3 鮑鈺,黃國興,張召;基于Web日志挖掘的網(wǎng)站結構優(yōu)化方法[J];計算機工程;2003年12期

4 王川;王大玲;于戈;馬海濤;劉鑫鋼;;基于用戶行為模型的搜索引擎[J];計算機工程;2008年04期

5 陳紅濤;楊放春;陳磊;;基于大規(guī)模中文搜索引擎的搜索日志挖掘[J];計算機應用研究;2008年06期

6 余慧佳;劉奕群;張敏;茹立云;馬少平;;基于大規(guī)模日志分析的搜索引擎用戶行為分析[J];中文信息學報;2007年01期

7 岑榮偉;劉奕群;張敏;茹立云;馬少平;;基于日志挖掘的搜索引擎用戶行為分析[J];中文信息學報;2010年03期

8 姚海燕;鄧小昭;;網(wǎng)絡用戶信息行為研究概述[J];情報探索;2010年02期

9 多雪松;張晶;高強;;基于Hadoop的海量數(shù)據(jù)管理系統(tǒng)[J];微計算機信息;2010年13期

10 崔林,宋瀚濤,龔永罡,陸玉昌;基于Web使用挖掘的個性化服務技術研究[J];計算機系統(tǒng)應用;2005年03期

相關碩士學位論文前5條

1 張文峰;基于MapReduce模型的分布式計算平臺的原理與設計[D];華中科技大學;2010年

2 萬至臻;基于MapReduce模型的并行計算平臺的設計與實現(xiàn)[D];浙江大學;2008年

3 朱珠;基于Hadoop的海量數(shù)據(jù)處理模型研究和應用[D];北京郵電大學;2008年

4 李云桃;基于Hadoop的海量數(shù)據(jù)處理系統(tǒng)的設計與實現(xiàn)[D];哈爾濱工業(yè)大學;2009年

5 夏yN;Hadoop平臺下的作業(yè)調度算法研究與改進[D];華南理工大學;2010年

，

本文編號：2314952

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2314952.html

上一篇：中文元搜索引擎萬緯搜索研究
下一篇：模具經(jīng)驗性知識的搜索條件預處理方法

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于hadoop的海量搜索日志分析平臺的設計和實現(xiàn)