融合多特征的TextRank關(guān)鍵詞抽取方法
發(fā)布時(shí)間:2018-05-20 23:19
本文選題:TextRank算法 + 關(guān)鍵詞抽取; 參考:《情報(bào)雜志》2017年08期
【摘要】:[目的/意義]關(guān)鍵詞提取在自然語言處理領(lǐng)域有著廣泛的應(yīng)用,如何快速準(zhǔn)確地實(shí)現(xiàn)關(guān)鍵詞的提取已經(jīng)成為文本處理的關(guān)鍵問題。目前關(guān)鍵詞提取方法非常多,但準(zhǔn)確率仍有待提升。為此,提出一種結(jié)合單一文檔內(nèi)部結(jié)構(gòu)信息、詞語對于單文檔和文檔集整體的重要性的關(guān)鍵詞抽取方法。[方法/過程]首先,根據(jù)詞語的平均信息熵特征計(jì)算詞語對文檔集整體的重要性,利用詞語的詞性、位置特征計(jì)算詞語對單文檔中的重要性。然后,通過神經(jīng)網(wǎng)絡(luò)訓(xùn)練的方式優(yōu)化三個(gè)特征的權(quán)重分配實(shí)現(xiàn)特征的融合。最后,利用三個(gè)特征計(jì)算得到詞語的綜合權(quán)值來改進(jìn)TextRank模型詞匯節(jié)點(diǎn)的初始權(quán)重以及概率轉(zhuǎn)移矩陣,再通過迭代法實(shí)現(xiàn)關(guān)鍵詞的抽取。[結(jié)果 /結(jié)論]該研究方法結(jié)合了文檔集整體信息和單文檔自身信息,其關(guān)鍵詞提取的準(zhǔn)確率較傳統(tǒng)TextRank方法、TFIDF-TextRank方法有了明顯的提高。
[Abstract]:Objective / meaning keyword extraction is widely used in the field of natural language processing. How to extract keywords quickly and accurately has become a key problem in text processing. At present, there are many methods of keyword extraction, but the accuracy still needs to be improved. This paper proposes a keyword extraction method which combines the internal structure information of a single document and the importance of words to the whole of a single document and a set of documents. [method / process] first, the importance of words to the whole document set is calculated according to the average information entropy feature of words, and the importance of words to a single document is calculated by using the word's part of speech and location feature. Then, the weights of the three features are optimized by neural network training to achieve feature fusion. Finally, the synthetic weights of the words are calculated by using three features to improve the initial weight and the probability transfer matrix of the lexical nodes in the TextRank model, and then the keyword extraction is realized by iterative method. [results / conclusion] this method combines the whole information of document set and the information of single document itself, and the accuracy of keyword extraction is much higher than that of the traditional TextRank method (TFIDF-TextRank).
【作者單位】: 廣東工業(yè)大學(xué)計(jì)算機(jī)學(xué)院;廣東工業(yè)大學(xué)藝術(shù)與設(shè)計(jì)學(xué)院;
【基金】:廣東省部產(chǎn)學(xué)研專項(xiàng)資金企業(yè)創(chuàng)新平臺(tái)“面向家電行業(yè)的用戶數(shù)據(jù)挖掘系統(tǒng)研究及體驗(yàn)式設(shè)計(jì)創(chuàng)新服務(wù)”(編號(hào):2013B090800042)
【分類號(hào)】:TP391.1
【相似文獻(xiàn)】
相關(guān)期刊論文 前1條
1 夏天;;詞語位置加權(quán)TextRank的關(guān)鍵詞抽取研究[J];現(xiàn)代圖書情報(bào)技術(shù);2013年09期
,本文編號(hào):1916683
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1916683.html
最近更新
教材專著