基于海量查詢?nèi)罩镜臄?shù)據(jù)挖掘及用戶行為分析
本文選題:海量日志 切入點:數(shù)據(jù)挖掘 出處:《北京郵電大學(xué)》2013年碩士論文 論文類型:學(xué)位論文
【摘要】:隨著互聯(lián)網(wǎng)和搜索引擎技術(shù)的飛速發(fā)展,Web中包含的信息不斷增加,搜索引擎成為大多數(shù)用戶為獲取網(wǎng)絡(luò)信息的首選。在用戶與搜索引擎的交互過程中,產(chǎn)生了海量的查詢?nèi)罩?而且這些日志還在不斷地增長。由于日志中蘊含了大量和用戶相關(guān)的信息,成為很多公司為更好地了解并吸引更多用戶的重點研究對象。利用分布式技術(shù)存儲并計算海量日志,使得對查詢?nèi)罩镜难芯孔兊酶臃奖。如今各大互?lián)網(wǎng)公司都越來越重視自己的查詢?nèi)罩?期望通過對這些日志進行及時、精確地分析和挖掘來發(fā)現(xiàn)隱藏在日志中的用戶行為特征,以此來提高用戶使用搜索引擎時的滿意度,提升企業(yè)的市場競爭力。 本文以海量查詢?nèi)罩咀鳛樘幚韺ο?主要進行的工作有: (1)對日志預(yù)處理技術(shù)的研究。主要研究了數(shù)據(jù)清洗、用戶識別、會話識別、路徑補充和事務(wù)識別以及相關(guān)算法,并將分布式技術(shù)和算法相結(jié)合,實現(xiàn)了基于Hadoop的日志預(yù)處理過程,為后面數(shù)據(jù)挖掘做準(zhǔn)備。 (2)設(shè)計用戶日志挖掘系統(tǒng)?紤]到日志海量的特點,傳統(tǒng)的數(shù)據(jù)存儲和計算方法難以適用于搜索引擎用戶行為分析中。針對此問題,本文提出基于MapReduce編程框架對海量日志進行挖掘的思想,根據(jù)日志中記錄的用戶查詢詞、點擊的URL和標(biāo)識用戶身份的ID對用戶行為進行建模,將用戶行為用特征向量來表示,給出不同用戶相似度的計算公式,分析了K-means算法分布式化的可行性并給出詳細的分布式實踐步驟。實驗證明,該算法能夠有效的對用戶聚類,并在處理海量數(shù)據(jù)時表現(xiàn)出較好的性能。 (3)對用戶行為進行分析。主要分析了日志量、用戶量及兩者的關(guān)系;用戶查詢詞的數(shù)量、長度、字符組成、常用查詢詞;被點擊的URL總量、URL的深度、常用URL;搜索引擎返回結(jié)果的順序與用戶點擊的順序之間的關(guān)系。經(jīng)過對日志的多角度分析,得出用戶行為的特征,從而為以后改善搜索引擎和用戶之間的交互體驗提供參考依據(jù)。
[Abstract]:With the rapid development of the Internet and search engine technology, the information contained in the Web is increasing, and the search engine has become the first choice for most users to obtain network information. In the process of interaction between users and search engines, massive query logs have been generated. And these logs are growing. Because they contain a lot of user-related information, they have become the focus of many companies to better understand and attract more users. It makes the research of query logs more convenient. Nowadays, all the major Internet companies are paying more and more attention to their own query logs, hoping to make these logs in a timely manner. In order to improve the users' satisfaction in using search engine and enhance the market competitiveness of enterprises, the user behavior characteristics hidden in the log are analyzed and mined accurately. This paper takes the massive query log as the processing object. The main work of this paper is as follows:. This paper mainly studies data cleaning, user identification, session identification, path complement, transaction identification and related algorithms, and combines distributed technology with algorithms. The process of log preprocessing based on Hadoop is implemented to prepare for data mining. 2) designing user log mining system. Considering the huge amount of logs, the traditional data storage and computing methods are difficult to be used in the behavior analysis of search engine users. In this paper, the idea of mining massive logs based on MapReduce programming framework is proposed. According to the user query words recorded in the log, the clicked URL and the ID identifying the user identity, the user behavior is modeled, and the user behavior is represented by the feature vector. The calculation formulas of different user similarity are given, the feasibility of distributed K-means algorithm is analyzed, and the detailed distributed practical steps are given. The experimental results show that the algorithm can effectively cluster users. And show good performance when dealing with massive data. Analysis of user behavior. This paper mainly analyzes the number of logs, the number of users and their relationship; the number, length, character composition, common query words of user query words; the total number of URLs clicked and the depth of URLs. The relationship between the order of the results returned by the search engine and the order in which the user clicks. Through the multi-angle analysis of the log, the characteristics of the user's behavior are obtained. So as to improve the interaction between search engines and users in the future to provide a reference basis.
【學(xué)位授予單位】:北京郵電大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2013
【分類號】:TP391.3;TP311.13
【參考文獻】
相關(guān)期刊論文 前10條
1 王嵐,張鵬祥;基于Web的數(shù)據(jù)挖掘研究[J];長春師范學(xué)院學(xué)報;2005年07期
2 孫健;賈曉菁;;Google云計算平臺的技術(shù)架構(gòu)及對其成本的影響研究[J];電信科學(xué);2010年01期
3 王建勇,單松巍,雷鳴,謝正茂,李曉明;海量Web搜索引擎系統(tǒng)中用戶行為的分布特征及其啟示[J];中國科學(xué)E輯:技術(shù)科學(xué);2001年04期
4 王繼成,潘金貴,張福炎;Web文本挖掘技術(shù)研究[J];計算機研究與發(fā)展;2000年05期
5 宋擒豹,沈鈞毅;Web日志的高效多能挖掘算法[J];計算機研究與發(fā)展;2001年03期
6 董一鴻,莊越挺;基于新型的競爭型神經(jīng)網(wǎng)絡(luò)的Web日志挖掘[J];計算機研究與發(fā)展;2003年05期
7 張慧穎,梁偉;基于用戶訪問模式挖掘的網(wǎng)頁實時推薦研究[J];計算機應(yīng)用;2004年06期
8 勾海波;歐陽為民;徐春榮;;搜索引擎查詢?nèi)罩局械木垲愃惴ㄑ芯縖J];計算機應(yīng)用與軟件;2007年03期
9 余慧佳;劉奕群;張敏;茹立云;馬少平;;基于大規(guī)模日志分析的搜索引擎用戶行為分析[J];中文信息學(xué)報;2007年01期
10 甘利人;岑詠華;李恒;;基于三階段過程的信息搜索影響因素分析[J];圖書情報工作;2007年02期
相關(guān)碩士學(xué)位論文 前3條
1 紀(jì)俊;一種基于云計算的數(shù)據(jù)挖掘平臺架構(gòu)設(shè)計與實現(xiàn)[D];青島大學(xué);2009年
2 陳勇;基于Hadoop平臺的通信數(shù)據(jù)分布式查詢算法的設(shè)計與實現(xiàn)[D];北京交通大學(xué);2009年
3 鄧自立;云計算中的網(wǎng)絡(luò)拓撲設(shè)計和Hadoop平臺研究[D];中國科學(xué)技術(shù)大學(xué);2009年
,本文編號:1648770
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1648770.html