天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當前位置:主頁 > 科技論文 > 軟件論文 >

基于SPARK的用戶特征分析

發(fā)布時間:2018-10-29 16:22
【摘要】:近年來,互聯(lián)網(wǎng)的飛速發(fā)展提供了一個豐富便捷的網(wǎng)絡環(huán)境,人們越來越習慣在網(wǎng)絡上進行交流、交易、娛樂等等,海量的用戶網(wǎng)絡數(shù)據(jù)充斥著整個互聯(lián)網(wǎng),越來越多的人看到了大數(shù)據(jù)背后隱藏的價值,全球范圍內掀起來大數(shù)據(jù)研究的浪潮;隨著大數(shù)據(jù)技術的火熱研究,吸引了國內外眾多學者投入到大數(shù)據(jù)挖掘的研究中,實現(xiàn)了基于用戶網(wǎng)絡行為數(shù)據(jù)的分析挖掘的研究體系。大數(shù)據(jù)計算平臺并不需要使用超高性能的服務器才能實現(xiàn),使用普通的PC即可搭建而成,并且這種集群化的模式表現(xiàn)出的計算性能往往比超高性能的服務器還要好。以Spark為代表的分布式計算平臺是近幾年剛剛興起并且快速發(fā)展的一種新技術,原因在于這種分布式平臺是基于內存的計算模式,可以提供海量存儲和超級計算的能力。把分析挖掘超大數(shù)據(jù)集的任務使用云計算方案來解決,能夠極大地提升計算速度和用戶分類的效能。因此,以Spark為代表的分布式計算平臺和海量用戶數(shù)據(jù)集的分類挖掘相融合,會是一個很有科研價值和應用潛力的研究方向。本文主要研究基于Spark和改進的TF-IDF算法的用戶特征分析,具體工作如下:1、研究了 Spark的相關技術以及Spark集群的搭建過程。使用樸素貝葉斯分類算法,結合Spark內存計算框架,對用戶觀看視頻及次數(shù)信息進行分析,建立用戶性別和年齡區(qū)間的分類模型;并進一步介紹了整個分析系統(tǒng)的架構。2、在基本的分類算法中,并沒考慮特征項權重問題,這樣并不能體現(xiàn)出每一個特征項的價值,基于這一因素,采用傳統(tǒng)的TF-IDF權重進行進一步實驗,與基本的分類算法對比分類效果。3、列出傳統(tǒng)的TF-IDF權重計算方法的缺陷,僅僅考慮特征項自身的價值,而沒有體現(xiàn)特征項與類別之間的相關性;針對這一問題,提出了一種基于特征項與類別間相關性的TFC-IDFC權重計算方法,并詳細介紹了優(yōu)化分類模型的過程,通過實驗得出分類結果。4、將改進的權重計算方法與基本分類算法和傳統(tǒng)的TF-IDF權重計算方法進行比較,通過正確率和F1值兩個指標,證明考慮到特征項與類別的相關性所提出的TFC-IDFC權重使得分類模型的分類能力更好。
[Abstract]:In recent years, the rapid development of the Internet has provided a rich and convenient network environment. People are more and more used to communicate, trade, entertain and so on the network. More and more people have seen the hidden value behind big data, and the wave of research has been raised in the whole world. With the hot research of big data technology, many scholars at home and abroad have been attracted to the research of big data mining, and realized the research system of analysis and mining based on user network behavior data. Big data computing platform does not need to use ultra-high performance server to achieve, using ordinary PC can be built, and this cluster mode often shows better computing performance than ultra-high performance server. The distributed computing platform, represented by Spark, is a new technology that has just emerged and developed rapidly in recent years. The reason is that the distributed platform is a memory-based computing model, which can provide mass storage and super computing capabilities. Using cloud computing to solve the task of analyzing and mining large data sets can greatly improve the computing speed and the efficiency of user classification. Therefore, the integration of the distributed computing platform represented by Spark and the classification and mining of massive user data sets will be a research direction with scientific research value and application potential. This paper mainly studies the user characteristics analysis based on Spark and improved TF-IDF algorithm. The main work is as follows: 1. The related technology of Spark and the process of building Spark cluster are studied. By using naive Bayesian classification algorithm and Spark memory computing framework, this paper analyzes the information of user watching video and times, and establishes the classification model of user's gender and age interval. And further introduced the structure of the whole analysis system. 2. In the basic classification algorithm, the weight of feature item is not considered, so it can not reflect the value of each feature item, based on this factor, The traditional TF-IDF weight is used for further experiments, and the classification effect is compared with the basic classification algorithm. 3. The defects of the traditional TF-IDF weight calculation method are listed, and only the value of the feature item itself is considered. It does not reflect the correlation between feature items and categories; In order to solve this problem, a TFC-IDFC weight calculation method based on the correlation between feature items and classes is proposed, and the process of optimizing classification model is introduced in detail. The improved weight calculation method is compared with the basic classification algorithm and the traditional TF-IDF weight calculation method. It is proved that the TFC-IDFC weight, which takes into account the correlation between feature items and categories, makes the classification model better.
【學位授予單位】:天津工業(yè)大學
【學位級別】:碩士
【學位授予年份】:2017
【分類號】:TP311.13

【參考文獻】

相關期刊論文 前7條

1 王慶福;;貝葉斯網(wǎng)絡在用戶興趣模型構建中的研究[J];無線互聯(lián)科技;2016年12期

2 龔靜;;基于Spark的用戶上網(wǎng)WAP日志分析[J];廣東通信技術;2015年01期

3 周文瓊;王樂球;葉玫;;云環(huán)境下Hadoop平臺的作業(yè)調度算法[J];計算機系統(tǒng)應用;2014年05期

4 何躍;鄧唯茹;張丹;;中文微博的情緒識別與分類研究[J];情報雜志;2014年02期

5 蔣在帆;王斌;;基于用戶行為分析的個人信息檢索研究[J];中文信息學報;2011年01期

6 余慧佳;劉奕群;張敏;茹立云;馬少平;;基于大規(guī)模日志分析的搜索引擎用戶行為分析[J];中文信息學報;2007年01期

7 慕春棣,tsinghua.edu.cn,戴劍彬,葉俊;用于數(shù)據(jù)挖掘的貝葉斯網(wǎng)絡[J];軟件學報;2000年05期

,

本文編號:2298200

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2298200.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權申明:資料由用戶dfa7d***提供,本站僅收錄摘要或目錄,作者需要刪除請E-mail郵箱bigeng88@qq.com
亚洲黄香蕉视频免费看| 久久精品免费视看国产成人 | 国产亚洲精品一二三区| 精品国自产拍天天青青草原| 亚洲专区中文字幕在线| 有坂深雪中文字幕亚洲中文 | 成人精品一区二区三区在线| 中文字幕久热精品视频在线| 午夜精品黄片在线播放| 国产精品免费视频视频| 初尝人妻少妇中文字幕在线| 熟妇久久人妻中文字幕| 国产精品一区二区高潮| 亚洲精品成人综合色在线| 冬爱琴音一区二区中文字幕| 亚洲熟女国产熟女二区三区| 日韩女优视频国产一区| 国产精品午夜视频免费观看| 日本不卡在线视频你懂的| 草草视频福利在线观看| 丰满的人妻一区二区三区| 日韩中文字幕欧美亚洲| 中文字幕一区久久综合| 深夜少妇一区二区三区| 日韩国产亚洲欧美另类| 国产午夜在线精品视频| 欧美激情视频一区二区三区| 自拍偷拍福利视频在线观看| 午夜福利视频日本一区| 黄色污污在线免费观看| 欧美三级不卡在线观线看| 久久精品免费视看国产成人| 久久精品偷拍视频观看| 国产成人精品久久二区二区| 91久久精品在这里色伊人| 欧美成人免费夜夜黄啪啪| 国产精品一区二区视频| 久久热在线视频免费观看| 国产成人精品一区二三区在线观看| 久久99爱爱视频视频| 日本99精品在线观看|