K-medoids聚類算法研究及其在文本聚類中的應(yīng)用

發(fā)布時(shí)間：2018-05-01 16:37

本文選題：K-medoids + 文本聚類　；參考：《重慶理工大學(xué)》2017年碩士論文

【摘要】：文本聚類就是將給定的文本集合劃分為多個(gè)類簇,以期達(dá)到不同類的文檔相似度較小,而同類的文檔相似度較大。作為一種無(wú)監(jiān)督的機(jī)器學(xué)習(xí)方法,由于聚類算法不需要訓(xùn)練過程,并且無(wú)需事先對(duì)文檔進(jìn)行手工標(biāo)注類別,因此具有一定的自動(dòng)化處理能力和較高的靈活性,已經(jīng)成為對(duì)文本信息進(jìn)行摘要、導(dǎo)航和有效組織的重要手段,受到越來(lái)越多的研究人員關(guān)注。在對(duì)文本進(jìn)行聚類處理時(shí),主要采用基于TF-IDF統(tǒng)計(jì)的向量空間模型表示文檔,其涉及到文本預(yù)處理、中文分詞、特征提取、特征權(quán)重計(jì)算、聚類算法、聚類性能測(cè)評(píng)等多個(gè)過程。其中特征項(xiàng)權(quán)重計(jì)算和聚類算法的選擇是基于向量空間模型的文本聚類算法中重要的兩個(gè)環(huán)節(jié),關(guān)系到文本的聚類效果。針對(duì)傳統(tǒng)的特征項(xiàng)權(quán)重計(jì)算方法只考慮頻率和逆文檔頻率,忽略了文檔所屬類別對(duì)特征權(quán)重的影響的問題,結(jié)合實(shí)際應(yīng)用中可能沒有標(biāo)準(zhǔn)分類數(shù)據(jù)集,本文提出了一種新的結(jié)合類別與語(yǔ)義貢獻(xiàn)度的特征項(xiàng)權(quán)重計(jì)算方法。該方法首先提出了語(yǔ)義貢獻(xiàn)度,并將其與模糊聚類相結(jié)合,對(duì)沒有類別信息的文本集合進(jìn)行粗略聚類得到帶類別信息的文本集合;然后提出了類別信息熵,并和語(yǔ)義貢獻(xiàn)度相結(jié)合對(duì)傳統(tǒng)的TF-IDF權(quán)重計(jì)算方法進(jìn)行改進(jìn),從而得到更有效的權(quán)重計(jì)算方法。利用復(fù)旦大學(xué)中文自然語(yǔ)言處理開放平臺(tái)中的中文文本分類語(yǔ)料庫(kù)數(shù)據(jù)集進(jìn)行測(cè)試,結(jié)果表明新的特征項(xiàng)權(quán)重計(jì)算方法優(yōu)于傳統(tǒng)的權(quán)重計(jì)算方法。針對(duì)K-medoids聚類算法對(duì)選擇的聚類初始中心點(diǎn)敏感,不當(dāng)?shù)某跏贾行狞c(diǎn)選擇可能導(dǎo)致聚類效果達(dá)到局部最優(yōu)的問題,本文提出了一種半徑自適應(yīng)的初始中心點(diǎn)選擇K-medoids聚類算法。該算法在每次迭代過程中會(huì)根據(jù)剩余樣本點(diǎn)的分布特征重新對(duì)半徑進(jìn)行計(jì)算,從而實(shí)現(xiàn)動(dòng)態(tài)計(jì)算對(duì)應(yīng)樣本點(diǎn)的鄰域半徑和局部方差,以此選出更優(yōu)的聚類初始中心點(diǎn),達(dá)到更好的聚類效果。分別在帶有不同比例隨機(jī)點(diǎn)的模擬數(shù)據(jù)集和規(guī)模大小不等的UCI數(shù)據(jù)集上進(jìn)行測(cè)試,并采用5個(gè)通用的聚類評(píng)價(jià)指標(biāo)進(jìn)行性能評(píng)價(jià),結(jié)果表明,本算法性能較同類算法有明顯提高。最后對(duì)本文改進(jìn)的文本聚類算法設(shè)計(jì)成一個(gè)文本聚類系統(tǒng),該系統(tǒng)對(duì)整個(gè)流程進(jìn)行了展示,并對(duì)該系統(tǒng)的實(shí)驗(yàn)結(jié)果進(jìn)行比較。
[Abstract]:Text clustering is to divide a given text set into multiple clusters, in order to achieve the document similarity of different classes is smaller, while the same kind of document similarity is larger. As an unsupervised machine learning method, the clustering algorithm does not need training process, and does not need to label the document manually in advance, so it has certain automatic processing ability and high flexibility. Text information has become an important means of summary, navigation and effective organization, which has attracted more and more researchers' attention. In the process of text clustering, the vector space model based on TF-IDF statistics is used to represent the document, which involves many processes, such as text preprocessing, Chinese word segmentation, feature extraction, feature weight calculation, clustering algorithm, clustering performance evaluation and so on. The weight calculation of feature items and the selection of clustering algorithm are two important links in the text clustering algorithm based on vector space model, which is related to the clustering effect of text. The traditional method only considers the frequency and the inverse document frequency, neglects the influence of the document category on the feature weight, and there may be no standard classification data set in the practical application. In this paper, a new method for calculating the weights of feature items combining category and semantic contribution is proposed. In this method, the semantic contribution degree is first proposed, and combined with fuzzy clustering, the text set without category information is roughly clustered to obtain the text set with category information, and then the category information entropy is proposed. Combined with semantic contribution, the traditional weight calculation method of TF-IDF is improved, and a more effective weight calculation method is obtained. The Chinese text classification corpus data set of Fudan University's Chinese natural language processing platform is used to test. The results show that the new method is better than the traditional weight calculation method. Aiming at the problem that the K-medoids clustering algorithm is sensitive to the selected initial center points and the improper selection of the initial center points may lead to the local optimal clustering effect, a radius adaptive initial center point selection K-medoids clustering algorithm is proposed in this paper. In each iteration process, the radius is calculated again according to the distribution characteristics of the remaining sample points, so that the neighborhood radius and local variance of the corresponding sample points can be dynamically calculated, so as to select a better clustering initial center point. Better clustering effect is achieved. The simulation data sets with different proportions of random points and the UCI data sets with different scales are tested, and five general cluster evaluation indexes are used to evaluate the performance. The results show that, The performance of this algorithm is obviously improved compared with the similar algorithm. At last, the improved text clustering algorithm is designed as a text clustering system. The whole process of the system is presented, and the experimental results of the system are compared.
【學(xué)位授予單位】：重慶理工大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2017
【分類號(hào)】：TP391.1

【參考文獻(xiàn)】

相關(guān)期刊論文前10條

1 曹曉;;文本聚類研究綜述[J];情報(bào)探索;2016年01期

2 朱曄;馮萬(wàn)興;郭鈞天;李雪皎;劉娟;;一種改進(jìn)的k-中心點(diǎn)聚類算法及在雷暴聚類中的應(yīng)用[J];武漢大學(xué)學(xué)報(bào)(理學(xué)版);2015年05期

3 潘楚;張?zhí)煳?羅可;;兩種新搜索策略對(duì)K-medoids聚類算法建模[J];小型微型計(jì)算機(jī)系統(tǒng);2015年07期

4 謝娟英;高瑞;;Num-近鄰方差優(yōu)化的K-medoids聚類算法[J];計(jì)算機(jī)應(yīng)用研究;2015年01期

5 翟東海;魚江;高飛;于磊;丁鋒;;最大距離法選取初始簇中心的K-means文本聚類算法的研究[J];計(jì)算機(jī)應(yīng)用研究;2014年03期

6 馮波;郝文寧;陳剛;占棟輝;;K-means算法初始聚類中心選擇的優(yōu)化[J];計(jì)算機(jī)工程與應(yīng)用;2013年14期

7 謝娟英;郭文娟;謝維信;;基于鄰域的K中心點(diǎn)聚類算法[J];陜西師范大學(xué)學(xué)報(bào)(自然科學(xué)版);2012年04期

8 馬箐;謝娟英;;基于粒計(jì)算的K-medoids聚類算法[J];計(jì)算機(jī)應(yīng)用;2012年07期

9 李學(xué)明;李海瑞;薛亮;何光軍;;基于信息增益與信息熵的TFIDF算法[J];計(jì)算機(jī)工程;2012年08期

10 徐峻嶺;周毓明;陳林;徐寶文;;基于互信息的無(wú)監(jiān)督特征選擇[J];計(jì)算機(jī)研究與發(fā)展;2012年02期

相關(guān)碩士學(xué)位論文前5條

1 樊東輝;基于文本聚類的特征選擇算法研究[D];西北師范大學(xué);2012年

2 萬(wàn)斌候;文本分類中的特征降維方法研究[D];重慶大學(xué);2012年

3 何海斌;文本分類中特征降維技術(shù)的研究[D];河北大學(xué);2010年

4 蔣健;文本分類中特征提取和特征加權(quán)方法研究[D];重慶大學(xué);2010年

5 宋麗平;文本分類中特征選擇方法的研究[D];西安科技大學(xué);2009年

，

本文編號(hào)：1830238

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1830238.html

上一篇：基于機(jī)器學(xué)習(xí)的Android惡意軟件靜態(tài)檢測(cè)系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)
下一篇：基于散列函數(shù)與半邊數(shù)據(jù)結(jié)構(gòu)的TIN拓?fù)渲貥?gòu)算法

論文發(fā)表

·知網(wǎng)|萬(wàn)方|維普|龍?jiān)磡省級(jí)|國(guó)家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

K-medoids聚類算法研究及其在文本聚類中的應(yīng)用