維吾爾文文本分類研究及系統(tǒng)開發(fā)
發(fā)布時間:2018-05-31 14:26
本文選題:維吾爾文 + 文本分類; 參考:《新疆大學(xué)》2012年碩士論文
【摘要】:隨著計算機(jī)與網(wǎng)絡(luò)技術(shù)的快速發(fā)展,互聯(lián)網(wǎng)得到了廣泛應(yīng)用。Web信息的快速增長給信息檢索帶來嚴(yán)峻的考驗,大量信息的出現(xiàn)使我們從中尋找需求的信息難度加大。文本分類對處理雜亂信息起著關(guān)鍵而有效的作用,在信息檢索,搜索引擎,數(shù)字圖書館管理等領(lǐng)域都有重要的應(yīng)用。 本文從維吾爾文的特點與書寫規(guī)則出發(fā),建立了(包含20類,每類300篇文本)規(guī)模較大的文本語料庫。深入研究并仔細(xì)考慮維吾爾文的特點和語法規(guī)則,通過進(jìn)行大量實驗和人工審核建立了比較完整的停用詞表。分析了詞干提取對維吾爾文文本分類準(zhǔn)確率和分類速度方面的影響。由于降低向量空間維數(shù)是文本分類中的一個很重要的問題,針對這一點本文利用維吾爾文的詞法規(guī)則采用了詞干提取方法,通過此方法不影響維吾爾文文本分類準(zhǔn)確率的同時達(dá)到了很好的降維目的。采用詞干提取方法以后,,將維25%左右。 在特征提取方法中采用CHI統(tǒng)計特征選擇方法,通過實驗分析特征數(shù)目的多少對實驗結(jié)果的影響,實驗結(jié)果表明,選取原始特征的3%-5%,相對來說是個最佳特征。通過大量實驗,分析了維吾爾文字拼寫錯誤對維吾爾文文本分類的影響。實驗結(jié)果表明,拼寫錯誤對維吾爾文文本分類的影響不大,但在降低向量空間維數(shù)方面有一定的影響。 較深入的研究了國內(nèi)外廣泛應(yīng)用的KNN,樸素貝葉斯(NB),SVM等的分類算法,并通過這些算法對維吾爾文文本進(jìn)行分類,分析了每一種算法在維吾爾文文本上的性能。最終把維吾爾語的特點和文本分類技術(shù)相結(jié)合,搭建了維吾爾文文本分類實驗平臺(維吾爾文文本分類系統(tǒng))。
[Abstract]:With the rapid development of computer and network technology, the Internet has been widely used. The rapid growth of Web information brings a severe test to information retrieval. The emergence of a large number of information makes it more difficult for us to find the information we need. Text classification plays a key and effective role in dealing with messy information. It has important applications in the fields of information retrieval, search engine, digital library management and so on. Based on the characteristics and writing rules of Uygur language, a large text corpus (including 20 categories, 300 texts per class) is established in this paper. In this paper, the characteristics and grammar rules of Uygur language are deeply studied and carefully considered, and a complete stop word list is established by a large number of experiments and manual verification. The effect of stem extraction on the accuracy and speed of Uygur text classification is analyzed. Because reducing the dimension of vector space is a very important problem in text classification, this paper uses the lexical rules of Uygur to extract the stem. The accuracy of Uygur text classification is not affected by this method, and a good dimension reduction is achieved at the same time. After using stem extraction method, the dimension is about 25%. In the feature extraction method, the CHI statistical feature selection method is adopted, and the influence of the number of features on the experimental results is analyzed experimentally. The experimental results show that the selection of the original feature 3- 5 is relatively the best feature. Through a large number of experiments, this paper analyzes the influence of Uygur spelling errors on Uygur text classification. The experimental results show that spelling errors have little effect on Uygur text classification, but have a certain effect on reducing the dimension of vector space. In this paper, the classification algorithms of KNN, naive Bayesian support Vector Machine (SVM), which are widely used at home and abroad, are deeply studied, and the performance of each algorithm on Uygur text is analyzed through these algorithms. Finally, combining the characteristics of Uygur language with text classification technology, a Uygur text classification experimental platform (Uygur text classification system) is built.
【學(xué)位授予單位】:新疆大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2012
【分類號】:TP391.1
【引證文獻(xiàn)】
相關(guān)碩士學(xué)位論文 前2條
1 祖麗湖瑪爾·馬木提江;維吾爾語區(qū)分性關(guān)鍵詞提取算法研究及其性能分析[D];新疆大學(xué);2013年
2 如先姑力·阿布都熱西提;維吾爾文詞語自動校對系統(tǒng)的設(shè)計與實現(xiàn)[D];電子科技大學(xué);2013年
本文編號:1960077
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1960077.html
最近更新
教材專著