基于改進(jìn)的TF-IDF算法及共現(xiàn)詞的主題詞抽取算法

發(fā)布時(shí)間：2018-03-27 20:33

本文選題：共現(xiàn)詞　切入點(diǎn)：互信息　出處：《南京大學(xué)學(xué)報(bào)(自然科學(xué))》2017年06期

【摘要】：信息主題的抽取是快速定位用戶需求的基礎(chǔ)任務(wù),主題詞抽取時(shí)主要存在三個(gè)問(wèn)題:一是詞語(yǔ)權(quán)重的計(jì)算,二是詞語(yǔ)間關(guān)系的度量,三是數(shù)據(jù)維度災(zāi)難.在計(jì)算詞權(quán)重時(shí)首先利用互信息確定共現(xiàn)詞對(duì),與詞頻、詞性、詞位置信息非線性組合,然后,根據(jù)詞權(quán)重構(gòu)建文檔—共現(xiàn)詞矩陣并建立潛在語(yǔ)義分析(Latent Semantic Analysis,LSA)模型.該方法借助LSA模型的奇異值分解(Singular Value Decomposition,SVD)將文檔—共現(xiàn)詞矩陣映射到潛在語(yǔ)義空間,不僅實(shí)現(xiàn)數(shù)據(jù)降維,而且獲得低維度的文檔相似矩陣.最后,對(duì)文檔相似矩陣進(jìn)行k-means聚類(lèi),在同類(lèi)文檔中選出詞權(quán)重最大的前幾對(duì)共現(xiàn)詞,作為該類(lèi)文章的主題詞.對(duì)比基于TF-IDF(Term Frequency-Inverse Document Frequency)和共現(xiàn)詞抽取主題詞的實(shí)驗(yàn),該算法的準(zhǔn)確度分別提高了19%和10%.
[Abstract]:The extraction of information topic is the basic task to locate the user's demand quickly. There are three main problems in the extraction of theme words: one is the calculation of the word weight, the other is the measurement of the relationship between words and phrases. The third is the disaster of data dimension. When calculating the word weight, we first use mutual information to determine the co-occurrence word pair, and the word frequency, part of speech, word position information, and then, According to the word weight, the document cooccurrence matrix is constructed and the latent Semantic analysis model is established. By using singular Value decomposition of the LSA model, the document cooccurrence matrix is mapped to the latent semantic space, which not only reduces the dimension of the data, but also reduces the dimension of the data. And the document similarity matrix of low dimension is obtained. Finally, the document similarity matrix is clustered by k-means, and the first few pairs of co-occurrence words with the largest word weight are selected from the similar documents. As the theme words of this kind of articles, the accuracy of the algorithm is improved by 19% and 10% respectively by comparing the experiments of extracting theme words based on TF-IDF(Term Frequency-Inverse Document frequency) and cooccurrence words.
【作者單位】：山東財(cái)經(jīng)大學(xué)計(jì)算機(jī)科學(xué)與技術(shù)學(xué)院;曲阜師范大學(xué)軟件學(xué)院;山東大學(xué)計(jì)算機(jī)學(xué)院;
【基金】：教育部人文社會(huì)科學(xué)研究項(xiàng)目(15YJAZH042) 山東省本科高校教學(xué)改革研究重點(diǎn)項(xiàng)目(2015Z058)
【分類(lèi)號(hào)】：TP391.1

【相似文獻(xiàn)】

相關(guān)期刊論文前10條

1 郭鋒,李紹滋,周昌樂(lè),林穎,李勝睿;基于詞匯吸引與排斥模型的共現(xiàn)詞提取[J];中文信息學(xué)報(bào);2004年06期

2 喬亞男;齊勇;侯迪;;一種高穩(wěn)定性詞匯共現(xiàn)模型[J];西安交通大學(xué)學(xué)報(bào);2009年06期

3 趙文清;侯小可;;基于詞共現(xiàn)圖的中文微博新聞話題識(shí)別[J];智能系統(tǒng)學(xué)報(bào);2012年05期

4 胡明生;賈志娟;雷利利;洪流;;基于共現(xiàn)分析的歷史自然災(zāi)害關(guān)聯(lián)研究[J];計(jì)算機(jī)工程與設(shè)計(jì);2013年06期

5 葛玲;蔣宗禮;;基于共現(xiàn)詞查詢的主題爬蟲(chóng)研究[J];計(jì)算機(jī)工程;2010年08期

6 孫愛(ài)珍;;語(yǔ)境共現(xiàn)詞匯鏈的自動(dòng)提取及與語(yǔ)篇銜接之關(guān)系(英文)[J];Chinese Journal of Applied Linguistics;2011年04期

7 陳，

本文編號(hào)：1673142

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1673142.html

上一篇：基于機(jī)器視覺(jué)的玻璃纖維布缺陷檢測(cè)技術(shù)研究
下一篇：基于Android的巷道爆破輔助系統(tǒng)設(shè)計(jì)與實(shí)現(xiàn)

論文發(fā)表

·知網(wǎng)|萬(wàn)方|維普|龍?jiān)磡省級(jí)|國(guó)家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于改進(jìn)的TF-IDF算法及共現(xiàn)詞的主題詞抽取算法