基于Python的維吾爾文文本聚類系統(tǒng)設(shè)計(jì)與實(shí)現(xiàn)

發(fā)布時(shí)間：2019-04-15 19:52

【摘要】：隨著因特網(wǎng)的迅速發(fā)展，互聯(lián)網(wǎng)的數(shù)據(jù)信息量越來越大。如何快速有效的獲取，，管理和使用這些數(shù)據(jù)已成為數(shù)據(jù)挖掘的重要研究內(nèi)容。文本聚類作為一個(gè)有效的管理和組織文本的工具，受到了越來越多的重視和研究。文本聚類技術(shù)可以在相當(dāng)?shù)某潭壬辖鉀Q這些問題，不僅可以節(jié)省時(shí)間，并且可以提高效率。在信息檢索，搜索引擎，數(shù)字圖書館管理等領(lǐng)域都有重要的應(yīng)用。本文首先以維吾爾文的特點(diǎn)出發(fā)建立了規(guī)模較大的文本語料庫。從積累的文本庫中構(gòu)造一個(gè)初步的停用詞表,為了達(dá)到降低特征空間的維數(shù)的目的，本文采用了詞干提取方法。實(shí)驗(yàn)結(jié)果表明采用的詞干提取方法可以減少了源特征維數(shù)的23%-25%。其次，深入研究了K-means和GAAC聚類算法的優(yōu)缺點(diǎn)。針對(duì)經(jīng)典K-means算法對(duì)初始聚類中心過分依賴的不穩(wěn)定性缺點(diǎn)，GAAC算法的時(shí)間復(fù)雜度高的缺點(diǎn)，研究出一種改進(jìn)的K-means算法。從實(shí)驗(yàn)結(jié)果得知，本文提出的改進(jìn)K-means算法是可行而且有效的。最后應(yīng)用這些算法實(shí)現(xiàn)了基于python的維吾爾文文本聚類系統(tǒng)。該系統(tǒng)包括預(yù)處理模塊，文本表示模塊，及聚類算法模塊等三個(gè)主要模塊。通過已開發(fā)的系統(tǒng)進(jìn)行對(duì)比實(shí)驗(yàn)，驗(yàn)證了改進(jìn)的K-means算法準(zhǔn)確性，穩(wěn)定性及時(shí)間復(fù)雜度低的性能。聚類效果表明該系統(tǒng)具有穩(wěn)定的運(yùn)行性能。
[Abstract]:With the rapid development of the Internet, the data information of the Internet is more and more large. How to acquire, manage and use these data quickly and effectively has become an important research content of data mining. As an effective tool to manage and organize text, text clustering has been paid more and more attention and research. Text clustering technology can solve these problems to a certain extent, not only can save time, but also can improve efficiency. There are important applications in the fields of information retrieval, search engine, digital library management and so on. In this paper, we first set up a large-scale text corpus based on the characteristics of Uighur. In order to reduce the dimension of feature space, a preliminary decommissioning thesaurus is constructed from the accumulated text database. In order to reduce the dimension of feature space, the method of word stem extraction is adopted in this paper. The experimental results show that the method can reduce the dimension of the source feature by 23% and 25%. Secondly, the advantages and disadvantages of K-means and GAAC clustering algorithms are deeply studied. An improved K-means algorithm is proposed to overcome the instability of the classical K-means algorithm due to its over-dependence on the initial clustering center and the high time complexity of the GAAC algorithm. The experimental results show that the improved K-means algorithm proposed in this paper is feasible and effective. Finally, the Uighur text clustering system based on python is implemented by using these algorithms. The system consists of three main modules: pretreatment module, text representation module and clustering algorithm module. Compared with the developed system, the accuracy, stability and low time complexity of the improved K-means algorithm are verified. The clustering results show that the system has stable performance.
【學(xué)位授予單位】：新疆大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2012
【分類號(hào)】：TP391.1

【參考文獻(xiàn)】

相關(guān)期刊論文前10條

1 李文斌;劉椿年;陳嶷瑛;;基于特征信息增益權(quán)重的文本分類算法[J];北京工業(yè)大學(xué)學(xué)報(bào);2006年05期

2 林鴻飛,馬雅彬;基于聚類的文本過濾模型[J];大連理工大學(xué)學(xué)報(bào);2002年02期

3 劉艷麗;劉希云;;一種基于密度的K-均值算法[J];計(jì)算機(jī)工程與應(yīng)用;2007年32期

4 范小麗;劉曉霞;;文本分類中互信息特征選擇方法的研究[J];計(jì)算機(jī)工程與應(yīng)用;2010年34期

5 劉志勇;耿新青;;基于模糊聚類的文本挖掘算法[J];計(jì)算機(jī)工程;2009年05期

6 張文明;吳江;袁小蛟;;基于密度和最近鄰的K-means文本聚類算法[J];計(jì)算機(jī)應(yīng)用;2010年07期

7 潘大勝;;基于改進(jìn)的K-means算法的文本聚類仿真系統(tǒng)[J];計(jì)算機(jī)仿真;2010年08期

8 龐劍鋒,卜東波,白碩;基于向量空間模型的文本自動(dòng)分類系統(tǒng)的研究與實(shí)現(xiàn)[J];計(jì)算機(jī)應(yīng)用研究;2001年09期

9 趙康;陸介平;倪巍偉;王桂平;;一種基于密度的文本聚類挖掘算法[J];計(jì)算機(jī)應(yīng)用研究;2009年01期

10 奉國和;;自動(dòng)文本分類技術(shù)研究[J];情報(bào)雜志;2007年12期

相關(guān)碩士學(xué)位論文前7條

1 韋魯玉;基于Agent的個(gè)性化智能信息檢索系統(tǒng)[D];哈爾濱理工大學(xué);2007年

2 姚清耘;基于向量空間模型的中文文本聚類方法的研究[D];上海交通大學(xué);2008年

3 鄭韞e

本文編號(hào)：2458444

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2458444.html

上一篇：基于WebLech的內(nèi)容搜索引擎設(shè)計(jì)
下一篇：面向智能搜索引擎的本體學(xué)習(xí)研究

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級(jí)|國家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于Python的維吾爾文文本聚類系統(tǒng)設(shè)計(jì)與實(shí)現(xiàn)