基于SPARK的用戶特征分析
發(fā)布時間:2018-10-29 16:22
【摘要】:近年來,互聯(lián)網(wǎng)的飛速發(fā)展提供了一個豐富便捷的網(wǎng)絡環(huán)境,人們越來越習慣在網(wǎng)絡上進行交流、交易、娛樂等等,海量的用戶網(wǎng)絡數(shù)據(jù)充斥著整個互聯(lián)網(wǎng),越來越多的人看到了大數(shù)據(jù)背后隱藏的價值,全球范圍內掀起來大數(shù)據(jù)研究的浪潮;隨著大數(shù)據(jù)技術的火熱研究,吸引了國內外眾多學者投入到大數(shù)據(jù)挖掘的研究中,實現(xiàn)了基于用戶網(wǎng)絡行為數(shù)據(jù)的分析挖掘的研究體系。大數(shù)據(jù)計算平臺并不需要使用超高性能的服務器才能實現(xiàn),使用普通的PC即可搭建而成,并且這種集群化的模式表現(xiàn)出的計算性能往往比超高性能的服務器還要好。以Spark為代表的分布式計算平臺是近幾年剛剛興起并且快速發(fā)展的一種新技術,原因在于這種分布式平臺是基于內存的計算模式,可以提供海量存儲和超級計算的能力。把分析挖掘超大數(shù)據(jù)集的任務使用云計算方案來解決,能夠極大地提升計算速度和用戶分類的效能。因此,以Spark為代表的分布式計算平臺和海量用戶數(shù)據(jù)集的分類挖掘相融合,會是一個很有科研價值和應用潛力的研究方向。本文主要研究基于Spark和改進的TF-IDF算法的用戶特征分析,具體工作如下:1、研究了 Spark的相關技術以及Spark集群的搭建過程。使用樸素貝葉斯分類算法,結合Spark內存計算框架,對用戶觀看視頻及次數(shù)信息進行分析,建立用戶性別和年齡區(qū)間的分類模型;并進一步介紹了整個分析系統(tǒng)的架構。2、在基本的分類算法中,并沒考慮特征項權重問題,這樣并不能體現(xiàn)出每一個特征項的價值,基于這一因素,采用傳統(tǒng)的TF-IDF權重進行進一步實驗,與基本的分類算法對比分類效果。3、列出傳統(tǒng)的TF-IDF權重計算方法的缺陷,僅僅考慮特征項自身的價值,而沒有體現(xiàn)特征項與類別之間的相關性;針對這一問題,提出了一種基于特征項與類別間相關性的TFC-IDFC權重計算方法,并詳細介紹了優(yōu)化分類模型的過程,通過實驗得出分類結果。4、將改進的權重計算方法與基本分類算法和傳統(tǒng)的TF-IDF權重計算方法進行比較,通過正確率和F1值兩個指標,證明考慮到特征項與類別的相關性所提出的TFC-IDFC權重使得分類模型的分類能力更好。
[Abstract]:In recent years, the rapid development of the Internet has provided a rich and convenient network environment. People are more and more used to communicate, trade, entertain and so on the network. More and more people have seen the hidden value behind big data, and the wave of research has been raised in the whole world. With the hot research of big data technology, many scholars at home and abroad have been attracted to the research of big data mining, and realized the research system of analysis and mining based on user network behavior data. Big data computing platform does not need to use ultra-high performance server to achieve, using ordinary PC can be built, and this cluster mode often shows better computing performance than ultra-high performance server. The distributed computing platform, represented by Spark, is a new technology that has just emerged and developed rapidly in recent years. The reason is that the distributed platform is a memory-based computing model, which can provide mass storage and super computing capabilities. Using cloud computing to solve the task of analyzing and mining large data sets can greatly improve the computing speed and the efficiency of user classification. Therefore, the integration of the distributed computing platform represented by Spark and the classification and mining of massive user data sets will be a research direction with scientific research value and application potential. This paper mainly studies the user characteristics analysis based on Spark and improved TF-IDF algorithm. The main work is as follows: 1. The related technology of Spark and the process of building Spark cluster are studied. By using naive Bayesian classification algorithm and Spark memory computing framework, this paper analyzes the information of user watching video and times, and establishes the classification model of user's gender and age interval. And further introduced the structure of the whole analysis system. 2. In the basic classification algorithm, the weight of feature item is not considered, so it can not reflect the value of each feature item, based on this factor, The traditional TF-IDF weight is used for further experiments, and the classification effect is compared with the basic classification algorithm. 3. The defects of the traditional TF-IDF weight calculation method are listed, and only the value of the feature item itself is considered. It does not reflect the correlation between feature items and categories; In order to solve this problem, a TFC-IDFC weight calculation method based on the correlation between feature items and classes is proposed, and the process of optimizing classification model is introduced in detail. The improved weight calculation method is compared with the basic classification algorithm and the traditional TF-IDF weight calculation method. It is proved that the TFC-IDFC weight, which takes into account the correlation between feature items and categories, makes the classification model better.
【學位授予單位】:天津工業(yè)大學
【學位級別】:碩士
【學位授予年份】:2017
【分類號】:TP311.13
本文編號:2298200
[Abstract]:In recent years, the rapid development of the Internet has provided a rich and convenient network environment. People are more and more used to communicate, trade, entertain and so on the network. More and more people have seen the hidden value behind big data, and the wave of research has been raised in the whole world. With the hot research of big data technology, many scholars at home and abroad have been attracted to the research of big data mining, and realized the research system of analysis and mining based on user network behavior data. Big data computing platform does not need to use ultra-high performance server to achieve, using ordinary PC can be built, and this cluster mode often shows better computing performance than ultra-high performance server. The distributed computing platform, represented by Spark, is a new technology that has just emerged and developed rapidly in recent years. The reason is that the distributed platform is a memory-based computing model, which can provide mass storage and super computing capabilities. Using cloud computing to solve the task of analyzing and mining large data sets can greatly improve the computing speed and the efficiency of user classification. Therefore, the integration of the distributed computing platform represented by Spark and the classification and mining of massive user data sets will be a research direction with scientific research value and application potential. This paper mainly studies the user characteristics analysis based on Spark and improved TF-IDF algorithm. The main work is as follows: 1. The related technology of Spark and the process of building Spark cluster are studied. By using naive Bayesian classification algorithm and Spark memory computing framework, this paper analyzes the information of user watching video and times, and establishes the classification model of user's gender and age interval. And further introduced the structure of the whole analysis system. 2. In the basic classification algorithm, the weight of feature item is not considered, so it can not reflect the value of each feature item, based on this factor, The traditional TF-IDF weight is used for further experiments, and the classification effect is compared with the basic classification algorithm. 3. The defects of the traditional TF-IDF weight calculation method are listed, and only the value of the feature item itself is considered. It does not reflect the correlation between feature items and categories; In order to solve this problem, a TFC-IDFC weight calculation method based on the correlation between feature items and classes is proposed, and the process of optimizing classification model is introduced in detail. The improved weight calculation method is compared with the basic classification algorithm and the traditional TF-IDF weight calculation method. It is proved that the TFC-IDFC weight, which takes into account the correlation between feature items and categories, makes the classification model better.
【學位授予單位】:天津工業(yè)大學
【學位級別】:碩士
【學位授予年份】:2017
【分類號】:TP311.13
【參考文獻】
相關期刊論文 前7條
1 王慶福;;貝葉斯網(wǎng)絡在用戶興趣模型構建中的研究[J];無線互聯(lián)科技;2016年12期
2 龔靜;;基于Spark的用戶上網(wǎng)WAP日志分析[J];廣東通信技術;2015年01期
3 周文瓊;王樂球;葉玫;;云環(huán)境下Hadoop平臺的作業(yè)調度算法[J];計算機系統(tǒng)應用;2014年05期
4 何躍;鄧唯茹;張丹;;中文微博的情緒識別與分類研究[J];情報雜志;2014年02期
5 蔣在帆;王斌;;基于用戶行為分析的個人信息檢索研究[J];中文信息學報;2011年01期
6 余慧佳;劉奕群;張敏;茹立云;馬少平;;基于大規(guī)模日志分析的搜索引擎用戶行為分析[J];中文信息學報;2007年01期
7 慕春棣,tsinghua.edu.cn,戴劍彬,葉俊;用于數(shù)據(jù)挖掘的貝葉斯網(wǎng)絡[J];軟件學報;2000年05期
,本文編號:2298200
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2298200.html
最近更新
教材專著