一種聚類算法的并行化改進及其在微博用戶聚類中的應用
發(fā)布時間:2018-04-12 19:40
本文選題:聚類算法 + 并行化。 參考:《上海交通大學》2014年碩士論文
【摘要】:聚類分析時數(shù)據(jù)挖掘中的重要技術(shù)。K均值算法是聚類分析中應用最廣泛的算法之一,被廣泛應用于計算機視覺、文本挖掘、客戶分析等各個領(lǐng)域。K均值算法具有簡單高效的優(yōu)點,同時也存在著對初始聚類中心敏感、聚類個數(shù)K需要人工給出等問題。凝聚模糊K均值算法是一種K均值算法的改進算法,該算法不易受初始點影響并且可以通過一種凝聚的方式自動對聚類個數(shù)進行搜索。但是凝聚模糊K均值算法也有迭代次數(shù)過多的缺陷。 該文首先針對凝聚模糊K均值算法的缺陷提出了一種改進的凝聚模糊K均值算法。改進算法使用一種初始中心選擇方法替代凝聚模糊K均值算法采用的隨機初始值選擇方法,減少了所需的迭代次數(shù)。同時改進算法應用基于MapReduce框架的分布式實現(xiàn)增加了算法處理大數(shù)據(jù)的能力,并在Hadoop及Mahout環(huán)境下進行了實現(xiàn)。之后對微博用戶聚類分析中的方法和問題進行了研究,引入了基于維基百科的微博文本主題分析方法提取用戶特征。最后應用改進算法對微博用戶進行聚類并對聚類結(jié)果進行分析。實驗結(jié)果表明,,改進算法可以減少運行過程所需地迭代次數(shù)并且在集群上具有很好地伸縮性能。對微博用戶聚類的結(jié)果進行分析表明,該算法可以獲得適合的用戶聚類結(jié)果。
[Abstract]:The clustering analysis of data mining technology in the important.K means algorithm is one of the most widely used algorithm in clustering analysis, is widely used in computer vision, text mining, customer analysis and other fields of.K means algorithm has the advantages of simple and efficient, there are also sensitive to the initial clustering center cluster number K manual is given other issues. Agglomerative fuzzy K means algorithm is an improved K algorithm for k-means algorithm, this algorithm is not easily affected by initial points and can be a way to automatically gather cluster number search. But the defect of condensed fuzzy K mean algorithm also has an excessive number of iterations.
This paper firstly condensed defects of fuzzy K means algorithm proposed an improved agglomerative fuzzy K means algorithm. The improved algorithm uses an initial center selection method instead of the random initial condensation of fuzzy K means algorithm uses value selection method to reduce the number of iterations required. Improved algorithm implementation of distributed MapReduce framework has increased the ability to handle large data based on the same algorithm, and implemented in Hadoop and Mahout environment. The method and problem analysis of micro-blog users clustering is studied, the introduction of micro blog Wikipedia this topic analysis method based on feature extraction of user. Finally, the improved algorithm is applied to clustering and clustering results of micro-blog users were analyzed. The experimental results show that the improved algorithm can reduce the number of iterations required for operation and has good scalability in cluster The results of the clustering of micro-blog users show that the algorithm can obtain the appropriate user clustering results.
【學位授予單位】:上海交通大學
【學位級別】:碩士
【學位授予年份】:2014
【分類號】:TP393.092;TP311.13
【參考文獻】
相關(guān)期刊論文 前1條
1 楊小朋;何躍;;騰訊微博用戶的特征分析[J];情報雜志;2012年03期
本文編號:1741139
本文鏈接:http://sikaile.net/guanlilunwen/ydhl/1741139.html
最近更新
教材專著