基于分布式處理的用戶行為特征提取與建模研究

發(fā)布時間：2019-04-24 11:56

【摘要】：隨著互聯(lián)網(wǎng)行業(yè)的蓬勃發(fā)展和運營商基礎(chǔ)設(shè)施與服務(wù)的不斷建設(shè)升級,用戶訪問互聯(lián)網(wǎng)而產(chǎn)生的數(shù)據(jù)日益豐富。分布式數(shù)據(jù)處理技術(shù)的發(fā)展和數(shù)據(jù)挖掘及機器學(xué)習(xí)領(lǐng)域的結(jié)合,使得針對互聯(lián)網(wǎng)用戶進行特征提取和行為偏好研究成為熱門領(lǐng)域。運營商作為數(shù)據(jù)管道掌握著全網(wǎng)范圍內(nèi)的網(wǎng)絡(luò)訪問流量記錄,在其采集的DPI數(shù)據(jù)上進行處理、挖掘和分析,對全方位刻畫用戶行為偏好有著巨大潛力。在此背景下,本文針對國內(nèi)某運營商采集的某市固網(wǎng)寬帶DPI數(shù)據(jù)進行了研究,利用分布式處理技術(shù)和數(shù)據(jù)挖掘相關(guān)方法從用戶的上網(wǎng)流量記錄中提取互聯(lián)網(wǎng)用戶行為特征。傳統(tǒng)的基于運營商流量的數(shù)據(jù)分析多是以研究各類業(yè)務(wù)的流量分布特性為切入點,描繪用戶不同時段使用不同種類應(yīng)用的行為習(xí)慣。本文以DPI記錄中URL為出發(fā)點,從用戶訪問網(wǎng)站的類別、序列模式特征和在線商品瀏覽等方面提取用戶上網(wǎng)行為特征,并進行了建模研究和實驗分析。首先,本文利用爬蟲技術(shù)從導(dǎo)航網(wǎng)站和分類目錄網(wǎng)站獲取網(wǎng)站分類標(biāo)簽庫,并且對上網(wǎng)終端搭載的操作系統(tǒng)進行識別,通過統(tǒng)計分析和聚類技術(shù)研究了基于網(wǎng)站標(biāo)簽的用戶群組興趣特征;其次,本文將序列模式挖掘方法應(yīng)用于全網(wǎng)范圍內(nèi)用戶跨多個網(wǎng)站的訪問特征研究,建立用戶訪問網(wǎng)站的序列模型,發(fā)現(xiàn)在全天范圍內(nèi)用戶的網(wǎng)站訪問行為在時序上的頻繁序列模式;最后,本文針對用戶訪問電商網(wǎng)站產(chǎn)生的流量進行了單獨研究,并結(jié)合爬蟲技術(shù)將用戶的興趣偏好特征直接細化到商品、品牌和類目三個級別,通過頻繁項集挖掘和關(guān)聯(lián)分析提取用戶在線瀏覽商品的偏好特征,并通過建模和實驗進行了全面的研究和分析。
[Abstract]:With the rapid development of Internet industry and the continuous construction and upgrading of operators' infrastructure and services, the data generated by users accessing the Internet is becoming more and more abundant. With the development of distributed data processing technology and the combination of data mining and machine learning, the research on feature extraction and behavior preference of Internet users has become a hot field. As a data pipeline, operators master the network access traffic records in the whole network, and process, mine and analyze the collected DPI data, which has great potential to portray the behavior preference of users in all directions. Under this background, this paper studies the fixed-line broadband DPI data collected by a domestic operator, and extracts the behavior characteristics of Internet users from users' Internet traffic records by means of distributed processing technology and data mining related methods. The traditional data analysis based on carrier traffic is based on the research of traffic distribution characteristics of all kinds of services, and describes the behavior habits of users using different kinds of applications at different times. Taking URL in DPI record as the starting point, this paper extracts the characteristics of users' online behavior from the categories of users visiting websites, sequence pattern features and online merchandise browsing, and carries on modeling research and experimental analysis. First of all, this paper uses crawler technology to obtain the website classification tag library from the navigation website and the classified directory website, and to identify the operating system on the Internet terminal. Through statistical analysis and clustering technology, the interest characteristics of user groups based on website tags are studied. Secondly, in this paper, the sequential pattern mining method is applied to the study of the access characteristics of users across multiple websites in the whole network, and the sequence model of users visiting the websites is established. The frequent sequence patterns of users' website visit behavior in time series are found in the whole day. Finally, this paper makes a separate study on the traffic generated by users visiting e-commerce websites, and combines with crawler technology to refine the user's interest and preference directly to three levels: commodity, brand and category. Through frequent itemsets mining and association analysis, the preference features of users browsing goods online are extracted, and comprehensive research and analysis are carried out through modeling and experiments.
【學(xué)位授予單位】：北京郵電大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2016
【分類號】：TP311.13;TP393.092

【參考文獻】

相關(guān)期刊論文前5條

1 楊波;;通信運營商寬帶用戶行為分析的研究與應(yīng)用[J];郵電設(shè)計技術(shù);2014年11期

2 邊凌燕;賀仁龍;姚曉輝;;基于DPI數(shù)據(jù)挖掘?qū)崿F(xiàn)URL分類掛載的相關(guān)技術(shù)研究[J];電信科學(xué);2013年11期

3 陶彩霞;謝曉軍;陳康;郭利榮;劉春;;基于云計算的移動互聯(lián)網(wǎng)大數(shù)據(jù)用戶行為分析引擎設(shè)計[J];電信科學(xué);2013年03期

4 劉棟;尉永清;薛文娟;;基于Map Reduce的序列模式挖掘算法[J];計算機工程;2012年15期

5 邢東山,沈鈞毅,宋擒豹;從Web日志中挖掘用戶瀏覽偏愛路徑[J];計算機學(xué)報;2003年11期

相關(guān)博士學(xué)位論文前2條

1 郭敏杰;基于云計算的海量網(wǎng)絡(luò)流量數(shù)據(jù)分析處理及關(guān)鍵算法研究[D];北京郵電大學(xué);2014年

2 竇伊男;根據(jù)多維特征的網(wǎng)絡(luò)用戶分類研究[D];北京郵電大學(xué);2010年

，

本文編號：2464425

資料下載