一種聚類(lèi)算法的并行化改進(jìn)及其在微博用戶(hù)聚類(lèi)中的應(yīng)用
發(fā)布時(shí)間:2018-04-12 19:40
本文選題:聚類(lèi)算法 + 并行化 ; 參考:《上海交通大學(xué)》2014年碩士論文
【摘要】:聚類(lèi)分析時(shí)數(shù)據(jù)挖掘中的重要技術(shù)。K均值算法是聚類(lèi)分析中應(yīng)用最廣泛的算法之一,被廣泛應(yīng)用于計(jì)算機(jī)視覺(jué)、文本挖掘、客戶(hù)分析等各個(gè)領(lǐng)域。K均值算法具有簡(jiǎn)單高效的優(yōu)點(diǎn),同時(shí)也存在著對(duì)初始聚類(lèi)中心敏感、聚類(lèi)個(gè)數(shù)K需要人工給出等問(wèn)題。凝聚模糊K均值算法是一種K均值算法的改進(jìn)算法,該算法不易受初始點(diǎn)影響并且可以通過(guò)一種凝聚的方式自動(dòng)對(duì)聚類(lèi)個(gè)數(shù)進(jìn)行搜索。但是凝聚模糊K均值算法也有迭代次數(shù)過(guò)多的缺陷。 該文首先針對(duì)凝聚模糊K均值算法的缺陷提出了一種改進(jìn)的凝聚模糊K均值算法。改進(jìn)算法使用一種初始中心選擇方法替代凝聚模糊K均值算法采用的隨機(jī)初始值選擇方法,減少了所需的迭代次數(shù)。同時(shí)改進(jìn)算法應(yīng)用基于MapReduce框架的分布式實(shí)現(xiàn)增加了算法處理大數(shù)據(jù)的能力,并在Hadoop及Mahout環(huán)境下進(jìn)行了實(shí)現(xiàn)。之后對(duì)微博用戶(hù)聚類(lèi)分析中的方法和問(wèn)題進(jìn)行了研究,引入了基于維基百科的微博文本主題分析方法提取用戶(hù)特征。最后應(yīng)用改進(jìn)算法對(duì)微博用戶(hù)進(jìn)行聚類(lèi)并對(duì)聚類(lèi)結(jié)果進(jìn)行分析。實(shí)驗(yàn)結(jié)果表明,,改進(jìn)算法可以減少運(yùn)行過(guò)程所需地迭代次數(shù)并且在集群上具有很好地伸縮性能。對(duì)微博用戶(hù)聚類(lèi)的結(jié)果進(jìn)行分析表明,該算法可以獲得適合的用戶(hù)聚類(lèi)結(jié)果。
[Abstract]:The clustering analysis of data mining technology in the important.K means algorithm is one of the most widely used algorithm in clustering analysis, is widely used in computer vision, text mining, customer analysis and other fields of.K means algorithm has the advantages of simple and efficient, there are also sensitive to the initial clustering center cluster number K manual is given other issues. Agglomerative fuzzy K means algorithm is an improved K algorithm for k-means algorithm, this algorithm is not easily affected by initial points and can be a way to automatically gather cluster number search. But the defect of condensed fuzzy K mean algorithm also has an excessive number of iterations.
This paper firstly condensed defects of fuzzy K means algorithm proposed an improved agglomerative fuzzy K means algorithm. The improved algorithm uses an initial center selection method instead of the random initial condensation of fuzzy K means algorithm uses value selection method to reduce the number of iterations required. Improved algorithm implementation of distributed MapReduce framework has increased the ability to handle large data based on the same algorithm, and implemented in Hadoop and Mahout environment. The method and problem analysis of micro-blog users clustering is studied, the introduction of micro blog Wikipedia this topic analysis method based on feature extraction of user. Finally, the improved algorithm is applied to clustering and clustering results of micro-blog users were analyzed. The experimental results show that the improved algorithm can reduce the number of iterations required for operation and has good scalability in cluster The results of the clustering of micro-blog users show that the algorithm can obtain the appropriate user clustering results.
【學(xué)位授予單位】:上海交通大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2014
【分類(lèi)號(hào)】:TP393.092;TP311.13
【參考文獻(xiàn)】
相關(guān)期刊論文 前1條
1 楊小朋;何躍;;騰訊微博用戶(hù)的特征分析[J];情報(bào)雜志;2012年03期
本文編號(hào):1741139
本文鏈接:http://sikaile.net/guanlilunwen/ydhl/1741139.html
最近更新
教材專(zhuān)著