基于分布式處理的用戶行為特征提取與建模研究
發(fā)布時(shí)間:2019-04-24 11:56
【摘要】:隨著互聯(lián)網(wǎng)行業(yè)的蓬勃發(fā)展和運(yùn)營(yíng)商基礎(chǔ)設(shè)施與服務(wù)的不斷建設(shè)升級(jí),用戶訪問(wèn)互聯(lián)網(wǎng)而產(chǎn)生的數(shù)據(jù)日益豐富。分布式數(shù)據(jù)處理技術(shù)的發(fā)展和數(shù)據(jù)挖掘及機(jī)器學(xué)習(xí)領(lǐng)域的結(jié)合,使得針對(duì)互聯(lián)網(wǎng)用戶進(jìn)行特征提取和行為偏好研究成為熱門(mén)領(lǐng)域。運(yùn)營(yíng)商作為數(shù)據(jù)管道掌握著全網(wǎng)范圍內(nèi)的網(wǎng)絡(luò)訪問(wèn)流量記錄,在其采集的DPI數(shù)據(jù)上進(jìn)行處理、挖掘和分析,對(duì)全方位刻畫(huà)用戶行為偏好有著巨大潛力。在此背景下,本文針對(duì)國(guó)內(nèi)某運(yùn)營(yíng)商采集的某市固網(wǎng)寬帶DPI數(shù)據(jù)進(jìn)行了研究,利用分布式處理技術(shù)和數(shù)據(jù)挖掘相關(guān)方法從用戶的上網(wǎng)流量記錄中提取互聯(lián)網(wǎng)用戶行為特征。傳統(tǒng)的基于運(yùn)營(yíng)商流量的數(shù)據(jù)分析多是以研究各類(lèi)業(yè)務(wù)的流量分布特性為切入點(diǎn),描繪用戶不同時(shí)段使用不同種類(lèi)應(yīng)用的行為習(xí)慣。本文以DPI記錄中URL為出發(fā)點(diǎn),從用戶訪問(wèn)網(wǎng)站的類(lèi)別、序列模式特征和在線商品瀏覽等方面提取用戶上網(wǎng)行為特征,并進(jìn)行了建模研究和實(shí)驗(yàn)分析。首先,本文利用爬蟲(chóng)技術(shù)從導(dǎo)航網(wǎng)站和分類(lèi)目錄網(wǎng)站獲取網(wǎng)站分類(lèi)標(biāo)簽庫(kù),并且對(duì)上網(wǎng)終端搭載的操作系統(tǒng)進(jìn)行識(shí)別,通過(guò)統(tǒng)計(jì)分析和聚類(lèi)技術(shù)研究了基于網(wǎng)站標(biāo)簽的用戶群組興趣特征;其次,本文將序列模式挖掘方法應(yīng)用于全網(wǎng)范圍內(nèi)用戶跨多個(gè)網(wǎng)站的訪問(wèn)特征研究,建立用戶訪問(wèn)網(wǎng)站的序列模型,發(fā)現(xiàn)在全天范圍內(nèi)用戶的網(wǎng)站訪問(wèn)行為在時(shí)序上的頻繁序列模式;最后,本文針對(duì)用戶訪問(wèn)電商網(wǎng)站產(chǎn)生的流量進(jìn)行了單獨(dú)研究,并結(jié)合爬蟲(chóng)技術(shù)將用戶的興趣偏好特征直接細(xì)化到商品、品牌和類(lèi)目三個(gè)級(jí)別,通過(guò)頻繁項(xiàng)集挖掘和關(guān)聯(lián)分析提取用戶在線瀏覽商品的偏好特征,并通過(guò)建模和實(shí)驗(yàn)進(jìn)行了全面的研究和分析。
[Abstract]:With the rapid development of Internet industry and the continuous construction and upgrading of operators' infrastructure and services, the data generated by users accessing the Internet is becoming more and more abundant. With the development of distributed data processing technology and the combination of data mining and machine learning, the research on feature extraction and behavior preference of Internet users has become a hot field. As a data pipeline, operators master the network access traffic records in the whole network, and process, mine and analyze the collected DPI data, which has great potential to portray the behavior preference of users in all directions. Under this background, this paper studies the fixed-line broadband DPI data collected by a domestic operator, and extracts the behavior characteristics of Internet users from users' Internet traffic records by means of distributed processing technology and data mining related methods. The traditional data analysis based on carrier traffic is based on the research of traffic distribution characteristics of all kinds of services, and describes the behavior habits of users using different kinds of applications at different times. Taking URL in DPI record as the starting point, this paper extracts the characteristics of users' online behavior from the categories of users visiting websites, sequence pattern features and online merchandise browsing, and carries on modeling research and experimental analysis. First of all, this paper uses crawler technology to obtain the website classification tag library from the navigation website and the classified directory website, and to identify the operating system on the Internet terminal. Through statistical analysis and clustering technology, the interest characteristics of user groups based on website tags are studied. Secondly, in this paper, the sequential pattern mining method is applied to the study of the access characteristics of users across multiple websites in the whole network, and the sequence model of users visiting the websites is established. The frequent sequence patterns of users' website visit behavior in time series are found in the whole day. Finally, this paper makes a separate study on the traffic generated by users visiting e-commerce websites, and combines with crawler technology to refine the user's interest and preference directly to three levels: commodity, brand and category. Through frequent itemsets mining and association analysis, the preference features of users browsing goods online are extracted, and comprehensive research and analysis are carried out through modeling and experiments.
【學(xué)位授予單位】:北京郵電大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2016
【分類(lèi)號(hào)】:TP311.13;TP393.092
本文編號(hào):2464425
[Abstract]:With the rapid development of Internet industry and the continuous construction and upgrading of operators' infrastructure and services, the data generated by users accessing the Internet is becoming more and more abundant. With the development of distributed data processing technology and the combination of data mining and machine learning, the research on feature extraction and behavior preference of Internet users has become a hot field. As a data pipeline, operators master the network access traffic records in the whole network, and process, mine and analyze the collected DPI data, which has great potential to portray the behavior preference of users in all directions. Under this background, this paper studies the fixed-line broadband DPI data collected by a domestic operator, and extracts the behavior characteristics of Internet users from users' Internet traffic records by means of distributed processing technology and data mining related methods. The traditional data analysis based on carrier traffic is based on the research of traffic distribution characteristics of all kinds of services, and describes the behavior habits of users using different kinds of applications at different times. Taking URL in DPI record as the starting point, this paper extracts the characteristics of users' online behavior from the categories of users visiting websites, sequence pattern features and online merchandise browsing, and carries on modeling research and experimental analysis. First of all, this paper uses crawler technology to obtain the website classification tag library from the navigation website and the classified directory website, and to identify the operating system on the Internet terminal. Through statistical analysis and clustering technology, the interest characteristics of user groups based on website tags are studied. Secondly, in this paper, the sequential pattern mining method is applied to the study of the access characteristics of users across multiple websites in the whole network, and the sequence model of users visiting the websites is established. The frequent sequence patterns of users' website visit behavior in time series are found in the whole day. Finally, this paper makes a separate study on the traffic generated by users visiting e-commerce websites, and combines with crawler technology to refine the user's interest and preference directly to three levels: commodity, brand and category. Through frequent itemsets mining and association analysis, the preference features of users browsing goods online are extracted, and comprehensive research and analysis are carried out through modeling and experiments.
【學(xué)位授予單位】:北京郵電大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2016
【分類(lèi)號(hào)】:TP311.13;TP393.092
【參考文獻(xiàn)】
相關(guān)期刊論文 前5條
1 楊波;;通信運(yùn)營(yíng)商寬帶用戶行為分析的研究與應(yīng)用[J];郵電設(shè)計(jì)技術(shù);2014年11期
2 邊凌燕;賀仁龍;姚曉輝;;基于DPI數(shù)據(jù)挖掘?qū)崿F(xiàn)URL分類(lèi)掛載的相關(guān)技術(shù)研究[J];電信科學(xué);2013年11期
3 陶彩霞;謝曉軍;陳康;郭利榮;劉春;;基于云計(jì)算的移動(dòng)互聯(lián)網(wǎng)大數(shù)據(jù)用戶行為分析引擎設(shè)計(jì)[J];電信科學(xué);2013年03期
4 劉棟;尉永清;薛文娟;;基于Map Reduce的序列模式挖掘算法[J];計(jì)算機(jī)工程;2012年15期
5 邢東山,沈鈞毅,宋擒豹;從Web日志中挖掘用戶瀏覽偏愛(ài)路徑[J];計(jì)算機(jī)學(xué)報(bào);2003年11期
相關(guān)博士學(xué)位論文 前2條
1 郭敏杰;基于云計(jì)算的海量網(wǎng)絡(luò)流量數(shù)據(jù)分析處理及關(guān)鍵算法研究[D];北京郵電大學(xué);2014年
2 竇伊男;根據(jù)多維特征的網(wǎng)絡(luò)用戶分類(lèi)研究[D];北京郵電大學(xué);2010年
,本文編號(hào):2464425
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2464425.html
最近更新
教材專(zhuān)著