文本關(guān)鍵詞提取技術(shù)及其應(yīng)用研究
發(fā)布時(shí)間:2018-06-05 21:16
本文選題:維吾爾文 + 關(guān)鍵詞提取; 參考:《新疆大學(xué)》2014年碩士論文
【摘要】:隨著網(wǎng)絡(luò)時(shí)代的到來,在線文檔開始涌現(xiàn)且其數(shù)量每天仍在急劇增加,面對(duì)如此浩大的信息資源,有效地提取對(duì)這些信息的關(guān)鍵內(nèi)容顯得十分重要。關(guān)鍵詞提取技術(shù)對(duì)文本自動(dòng)摘要生成、文本分類、文本聚類和信息檢索等研究都具有重要意義。 首先,本文建立了用于訓(xùn)練和測試的文本語料數(shù)據(jù)庫,總計(jì)1000篇(其中500篇屬于健康類,其余500篇屬于計(jì)算機(jī)、教育、經(jīng)濟(jì)、房地產(chǎn)、歷史、地理等非健康類文檔)。其次,本文應(yīng)用了基于TextRank的關(guān)鍵詞提取方法。實(shí)驗(yàn)結(jié)果表明,用此方法獲得的最高文檔分類正確率為75.5%,再增加關(guān)鍵詞數(shù)對(duì)分類結(jié)果無明顯貢獻(xiàn)。為了進(jìn)一步提高分類精度,我們提出了基于TF/IDF的區(qū)分性關(guān)鍵詞提取方法,該方法通過計(jì)算同一詞語在不同組合統(tǒng)計(jì)量下的類間差異得到區(qū)分性關(guān)鍵詞。實(shí)驗(yàn)結(jié)果表明,區(qū)分性關(guān)鍵詞提取方法獲得的最高文檔分類正確率高達(dá)98.5%(關(guān)鍵詞語數(shù)量為100);赥F/IDF的區(qū)分性關(guān)鍵詞提取方法雖然在文檔分類上很有效,但是都以收集大量關(guān)鍵詞語為基礎(chǔ),且缺少理論基礎(chǔ),具有一定的局限性。因此,本文又引用了在生物技術(shù)領(lǐng)域中常見的SDA(稀疏判別分析)方法。實(shí)驗(yàn)結(jié)果證明,該方法獲得的文檔分類正確率為98%(關(guān)鍵詞語數(shù)量為90),實(shí)現(xiàn)了在少量數(shù)據(jù)集上較高的分類效果。于是,在少量數(shù)據(jù)集上進(jìn)一步提高正確率,我們又研究了基于SparseSVM的關(guān)鍵詞提取方法。實(shí)驗(yàn)結(jié)果是,關(guān)鍵詞數(shù)量分別在10、20、30時(shí),基于SDA的方法獲得文檔分類正確率分別為88.5%、90.5%、91.5%,而基于SparseSVM的方法則分別為90%、92%、95.5%。這些表明,SparseSVM方法在少量數(shù)據(jù)集上更有效。 為了驗(yàn)證上述技術(shù)的性能穩(wěn)定性,本文最后還給出了基于以上四種方法的維吾爾文本情感辨識(shí)實(shí)驗(yàn)結(jié)果,其結(jié)果令人滿意。
[Abstract]:With the advent of the network era, the number of online documents is still increasing rapidly every day. In the face of so large information resources, it is very important to extract the key content of these information effectively. Keyword extraction is of great significance to the research of text automatic summary generation, text classification, text clustering and information retrieval. First of all, this paper establishes a text corpus database for training and testing, a total of 1000 articles (of which 500 belong to the category of health, the remaining 500 belong to computer, education, economics, real estate, history, geography and other unhealthy documents. Secondly, this paper applies the keyword extraction method based on TextRank. The experimental results show that the highest classification accuracy rate of this method is 75.5, and the increase of the number of keywords has no significant contribution to the classification results. In order to further improve the classification accuracy, we propose a discriminative keyword extraction method based on TF/IDF, which obtains the discriminative keywords by calculating the differences between classes of the same word under different combination statistics. The experimental results show that the highest correct rate of document classification obtained by the discriminative keyword extraction method is as high as 98.5% (the number of key words is 100). Although the discriminative keyword extraction method based on TF/IDF is very effective in document classification, it is based on the collection of a large number of key words and lacks the theoretical basis, so it has some limitations. Therefore, the SDA (sparse discriminant analysis) method, which is commonly used in the field of biotechnology, is also cited in this paper. The experimental results show that the correct rate of document classification obtained by this method is 98 (the number of key words is 90), which can achieve a higher classification effect on a small number of data sets. Therefore, we further improve the accuracy on a small number of data sets, and we also study the keyword extraction method based on SparseSVM. The experimental results are as follows: at 1020 and 30, respectively, the correct rate of document classification obtained by the method based on SDA is 88.50.5,90.5 and 91.5, respectively, while the method based on SparseSVM is 92,92and 95.555, respectively. These results show that the SparseSVM method is more effective on a small number of data sets. In order to verify the performance stability of the above techniques, the experimental results of Uygur text emotion identification based on the above four methods are presented, and the results are satisfactory.
【學(xué)位授予單位】:新疆大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2014
【分類號(hào)】:TP391.1
【參考文獻(xiàn)】
相關(guān)期刊論文 前6條
1 杜春;孫即祥;周石琳;王亮亮;趙晶晶;;基于稀疏表示和非參數(shù)判別分析的降維算法[J];國防科技大學(xué)學(xué)報(bào);2013年02期
2 ;Keyword Extraction Based on tf/idf for Chinese News Document[J];Wuhan University Journal of Natural Sciences;2007年05期
3 王玲;薄列峰;劉芳;焦李成;;稀疏隱空間支持向量機(jī)[J];西安電子科技大學(xué)學(xué)報(bào);2006年06期
4 鄒加棋;陳國龍;郭文忠;;基于圖模型的中文文檔分類研究[J];小型微型計(jì)算機(jī)系統(tǒng);2006年04期
5 陽春華;任會(huì)峰;許燦輝;桂衛(wèi)華;;基于稀疏多核最小二乘支持向量機(jī)的浮選關(guān)鍵指標(biāo)軟測量[J];中國有色金屬學(xué)報(bào);2011年12期
6 胡局新;鞠訓(xùn)光;;基于貝葉斯推理和TFIDF算法的中文關(guān)鍵詞智能抽取[J];微電子學(xué)與計(jì)算機(jī);2012年09期
,本文編號(hào):1983427
本文鏈接:http://sikaile.net/jingjilunwen/fangdichanjingjilunwen/1983427.html
最近更新
教材專著