基于分布式的網(wǎng)絡(luò)用戶行為分析系統(tǒng)的設(shè)計
發(fā)布時間:2018-03-30 20:12
本文選題:用戶行為分析 切入點:數(shù)據(jù)挖掘 出處:《北京郵電大學》2014年碩士論文
【摘要】:伴隨著移動終端的應(yīng)用,網(wǎng)絡(luò)用戶群體的規(guī)模有一次飛速的擴大,頻繁的訪問行為積累了海量數(shù)據(jù),隱含著有效信息,可以為網(wǎng)絡(luò)服務(wù)和網(wǎng)絡(luò)安全提供指導(dǎo)或者為網(wǎng)站建設(shè)者提供結(jié)構(gòu)方面的支持,所以對網(wǎng)絡(luò)用戶訪問行為的分析成為了研究熱點。由于網(wǎng)絡(luò)用戶個體化差異,單一用戶行為往往不構(gòu)成特征規(guī)律,而當考慮客戶群體的時候,隱含的特征規(guī)律便顯現(xiàn)出來。如果能準確的把握用戶群體的特征規(guī)律,進而劃分客戶群體,互聯(lián)網(wǎng)應(yīng)用和服務(wù)提供者便可以針對不同客戶群體的需要提供個性化的服務(wù)和高附加值的業(yè)務(wù)推薦,實現(xiàn)網(wǎng)絡(luò)客戶群和網(wǎng)絡(luò)服務(wù)者的利益最大化。 本課題設(shè)計了一個高性能的分布式網(wǎng)絡(luò)用戶行為分析系統(tǒng)來劃分用戶群體。 首先,爬取網(wǎng)頁內(nèi)容,通過TFIDF分詞技術(shù)提取網(wǎng)頁關(guān)鍵字,并構(gòu)成頁面向量,同時通過WEB服務(wù)器得到用戶訪問的上下文信息,通過數(shù)據(jù)預(yù)處理模塊,消除冗余度,形成具有唯一性且冗余度低的數(shù)據(jù)源。 其次,詳細研究并改進了數(shù)據(jù)挖掘技術(shù)中的聚類方法,并在Hadoop分布式處理框架MapReduce中實現(xiàn)了算法的并行化,使其更能適合現(xiàn)實中海量數(shù)據(jù)的處理,并驗證了MapReduce并行化處理性能上的提升。 之后,設(shè)計出分布式用戶行為分析系統(tǒng)的框架,包括數(shù)據(jù)采集模塊,數(shù)據(jù)預(yù)處理模塊,文本聚類模塊,知識結(jié)果集模塊并實現(xiàn)了各個模塊的主要功能,并根據(jù)現(xiàn)有的系統(tǒng)性能測試指標對該系統(tǒng)進行了測試和評估,最后總結(jié)了論文的特點以及不足之處,并提出了對前景的展望。
[Abstract]:With the application of mobile terminal, the scale of the network user group has expanded rapidly, and the frequent access behavior accumulates massive data, which implies the effective information.It can provide guidance for network services and network security or provide structural support for website builders, so the analysis of network users' access behavior has become a hot research topic.Because of the individualized differences of network users, the single user behavior often does not constitute the characteristic law, but when the customer group is considered, the implicit characteristic law appears.If you can accurately grasp the characteristics of the user group, and then divide the customer group, Internet applications and service providers can provide personalized services and high value-added business recommendations to meet the needs of different customer groups.To maximize the benefits of network customers and network service providers.In this paper, a high performance distributed network user behavior analysis system is designed to divide user groups.First of all, crawl the content of the web page, extract the key words of the page by TFIDF segmentation technology, and form the page vector. At the same time, the context information accessed by the user is obtained through the WEB server, and the redundancy is eliminated by the data preprocessing module.A unique and low redundancy data source is formed.Secondly, the clustering method in data mining technology is studied and improved in detail, and the algorithm is parallelized in Hadoop distributed processing framework (MapReduce), which makes it more suitable for mass data processing in reality.The performance improvement of MapReduce parallelization processing is verified.Then, the framework of the distributed user behavior analysis system is designed, including data acquisition module, data preprocessing module, text clustering module, knowledge result set module and the main functions of each module.The system is tested and evaluated according to the existing system performance test index. Finally, the characteristics and shortcomings of the paper are summarized, and the prospect of the system is put forward.
【學位授予單位】:北京郵電大學
【學位級別】:碩士
【學位授予年份】:2014
【分類號】:TP393.092;TP391.1
【參考文獻】
相關(guān)期刊論文 前10條
1 於躍成;王建東;鄭關(guān)勝;陳斌;;基于約束信息的并行k-means算法[J];東南大學學報(自然科學版);2011年03期
2 趙衛(wèi)中;馬慧芳;傅燕翔;史忠植;;基于云計算平臺Hadoop的并行k-means聚類算法設(shè)計研究[J];計算機科學;2011年10期
3 張蓉;;Web挖掘技術(shù)研究[J];計算機工程;2006年15期
4 王曙寧;俞建新;;基于本體的上下文感知系統(tǒng)[J];計算機工程;2007年11期
5 何麗莉;白洪濤;;用聚類分析方法挖掘Aspect[J];計算機集成制造系統(tǒng);2006年01期
6 劉曉鵬,邢長征;基于WEB文本數(shù)據(jù)挖掘的研究[J];計算機與數(shù)字工程;2005年09期
7 周p,
本文編號:1687468
本文鏈接:http://sikaile.net/guanlilunwen/ydhl/1687468.html
最近更新
教材專著