基于TextRank算法和互信息相似度的維吾爾文關(guān)鍵詞提取及文本分類
發(fā)布時(shí)間:2018-04-25 16:46
本文選題:維吾爾語 + 文本分類 ; 參考:《計(jì)算機(jī)科學(xué)》2016年12期
【摘要】:針對維吾爾語文本的分類問題,提出一種基于TextRank算法和互信息相似度的維吾爾文關(guān)鍵詞提取及文本分類方法。首先,對輸入文本進(jìn)行預(yù)處理,濾除非維吾爾語的字符和停用詞;然后,利用詞語語義相似度、詞語位置和詞頻重要性加權(quán)的TextRank算法提取文本關(guān)鍵詞集合;最后,根據(jù)互信息相似度度量,計(jì)算輸入文本關(guān)鍵詞集和各類關(guān)鍵詞集的相似度,最終實(shí)現(xiàn)文本的分類。實(shí)驗(yàn)結(jié)果表明,該方案能夠提取出具有較高識(shí)別度的關(guān)鍵詞,當(dāng)關(guān)鍵詞集大小為1250時(shí),平均分類率達(dá)到了91.2%。
[Abstract]:Aiming at the problem of Uygur Chinese text classification, a method of Uygur keyword extraction and text classification based on TextRank algorithm and mutual information similarity is proposed. First, the input text is preprocessed to filter the characters and stop words except Uygur. Then, the text keyword set is extracted by using the TextRank algorithm, which is weighted by semantic similarity, word position and word frequency. According to the similarity measure of mutual information, the similarity between input text keyword sets and all kinds of keyword sets is calculated, and finally the text classification is realized. The experimental results show that the scheme can extract keywords with high recognition degree. When the size of keyword set is 1250, the average classification rate reaches 91.2%.
【作者單位】: 中國科學(xué)院新疆理化技術(shù)研究所;中國科學(xué)院大學(xué);新疆多語種信息技術(shù)重點(diǎn)實(shí)驗(yàn)室;
【基金】:新疆多語種信息技術(shù)重點(diǎn)實(shí)驗(yàn)室開放課題(XJDX0905-2013-06)資助
【分類號】:TP391.1
【相似文獻(xiàn)】
中國期刊全文數(shù)據(jù)庫 前1條
1 夏天;;詞語位置加權(quán)TextRank的關(guān)鍵詞抽取研究[J];現(xiàn)代圖書情報(bào)技術(shù);2013年09期
,本文編號:1802151
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1802151.html
最近更新
教材專著