基于Python的維吾爾文文本聚類系統(tǒng)設計與實現(xiàn)
[Abstract]:With the rapid development of the Internet, the data information of the Internet is more and more large. How to acquire, manage and use these data quickly and effectively has become an important research content of data mining. As an effective tool to manage and organize text, text clustering has been paid more and more attention and research. Text clustering technology can solve these problems to a certain extent, not only can save time, but also can improve efficiency. There are important applications in the fields of information retrieval, search engine, digital library management and so on. In this paper, we first set up a large-scale text corpus based on the characteristics of Uighur. In order to reduce the dimension of feature space, a preliminary decommissioning thesaurus is constructed from the accumulated text database. In order to reduce the dimension of feature space, the method of word stem extraction is adopted in this paper. The experimental results show that the method can reduce the dimension of the source feature by 23% and 25%. Secondly, the advantages and disadvantages of K-means and GAAC clustering algorithms are deeply studied. An improved K-means algorithm is proposed to overcome the instability of the classical K-means algorithm due to its over-dependence on the initial clustering center and the high time complexity of the GAAC algorithm. The experimental results show that the improved K-means algorithm proposed in this paper is feasible and effective. Finally, the Uighur text clustering system based on python is implemented by using these algorithms. The system consists of three main modules: pretreatment module, text representation module and clustering algorithm module. Compared with the developed system, the accuracy, stability and low time complexity of the improved K-means algorithm are verified. The clustering results show that the system has stable performance.
【學位授予單位】:新疆大學
【學位級別】:碩士
【學位授予年份】:2012
【分類號】:TP391.1
【參考文獻】
相關期刊論文 前10條
1 李文斌;劉椿年;陳嶷瑛;;基于特征信息增益權重的文本分類算法[J];北京工業(yè)大學學報;2006年05期
2 林鴻飛,馬雅彬;基于聚類的文本過濾模型[J];大連理工大學學報;2002年02期
3 劉艷麗;劉希云;;一種基于密度的K-均值算法[J];計算機工程與應用;2007年32期
4 范小麗;劉曉霞;;文本分類中互信息特征選擇方法的研究[J];計算機工程與應用;2010年34期
5 劉志勇;耿新青;;基于模糊聚類的文本挖掘算法[J];計算機工程;2009年05期
6 張文明;吳江;袁小蛟;;基于密度和最近鄰的K-means文本聚類算法[J];計算機應用;2010年07期
7 潘大勝;;基于改進的K-means算法的文本聚類仿真系統(tǒng)[J];計算機仿真;2010年08期
8 龐劍鋒,卜東波,白碩;基于向量空間模型的文本自動分類系統(tǒng)的研究與實現(xiàn)[J];計算機應用研究;2001年09期
9 趙康;陸介平;倪巍偉;王桂平;;一種基于密度的文本聚類挖掘算法[J];計算機應用研究;2009年01期
10 奉國和;;自動文本分類技術研究[J];情報雜志;2007年12期
相關碩士學位論文 前7條
1 韋魯玉;基于Agent的個性化智能信息檢索系統(tǒng)[D];哈爾濱理工大學;2007年
2 姚清耘;基于向量空間模型的中文文本聚類方法的研究[D];上海交通大學;2008年
3 鄭韞e
本文編號:2458444
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2458444.html