基于改進(jìn)的TF-IDF算法及共現(xiàn)詞的主題詞抽取算法
發(fā)布時(shí)間:2018-03-27 20:33
本文選題:共現(xiàn)詞 切入點(diǎn):互信息 出處:《南京大學(xué)學(xué)報(bào)(自然科學(xué))》2017年06期
【摘要】:信息主題的抽取是快速定位用戶需求的基礎(chǔ)任務(wù),主題詞抽取時(shí)主要存在三個(gè)問題:一是詞語權(quán)重的計(jì)算,二是詞語間關(guān)系的度量,三是數(shù)據(jù)維度災(zāi)難.在計(jì)算詞權(quán)重時(shí)首先利用互信息確定共現(xiàn)詞對(duì),與詞頻、詞性、詞位置信息非線性組合,然后,根據(jù)詞權(quán)重構(gòu)建文檔—共現(xiàn)詞矩陣并建立潛在語義分析(Latent Semantic Analysis,LSA)模型.該方法借助LSA模型的奇異值分解(Singular Value Decomposition,SVD)將文檔—共現(xiàn)詞矩陣映射到潛在語義空間,不僅實(shí)現(xiàn)數(shù)據(jù)降維,而且獲得低維度的文檔相似矩陣.最后,對(duì)文檔相似矩陣進(jìn)行k-means聚類,在同類文檔中選出詞權(quán)重最大的前幾對(duì)共現(xiàn)詞,作為該類文章的主題詞.對(duì)比基于TF-IDF(Term Frequency-Inverse Document Frequency)和共現(xiàn)詞抽取主題詞的實(shí)驗(yàn),該算法的準(zhǔn)確度分別提高了19%和10%.
[Abstract]:The extraction of information topic is the basic task to locate the user's demand quickly. There are three main problems in the extraction of theme words: one is the calculation of the word weight, the other is the measurement of the relationship between words and phrases. The third is the disaster of data dimension. When calculating the word weight, we first use mutual information to determine the co-occurrence word pair, and the word frequency, part of speech, word position information, and then, According to the word weight, the document cooccurrence matrix is constructed and the latent Semantic analysis model is established. By using singular Value decomposition of the LSA model, the document cooccurrence matrix is mapped to the latent semantic space, which not only reduces the dimension of the data, but also reduces the dimension of the data. And the document similarity matrix of low dimension is obtained. Finally, the document similarity matrix is clustered by k-means, and the first few pairs of co-occurrence words with the largest word weight are selected from the similar documents. As the theme words of this kind of articles, the accuracy of the algorithm is improved by 19% and 10% respectively by comparing the experiments of extracting theme words based on TF-IDF(Term Frequency-Inverse Document frequency) and cooccurrence words.
【作者單位】: 山東財(cái)經(jīng)大學(xué)計(jì)算機(jī)科學(xué)與技術(shù)學(xué)院;曲阜師范大學(xué)軟件學(xué)院;山東大學(xué)計(jì)算機(jī)學(xué)院;
【基金】:教育部人文社會(huì)科學(xué)研究項(xiàng)目(15YJAZH042) 山東省本科高校教學(xué)改革研究重點(diǎn)項(xiàng)目(2015Z058)
【分類號(hào)】:TP391.1
【相似文獻(xiàn)】
相關(guān)期刊論文 前10條
1 郭鋒,李紹滋,周昌樂,林穎,李勝睿;基于詞匯吸引與排斥模型的共現(xiàn)詞提取[J];中文信息學(xué)報(bào);2004年06期
2 喬亞男;齊勇;侯迪;;一種高穩(wěn)定性詞匯共現(xiàn)模型[J];西安交通大學(xué)學(xué)報(bào);2009年06期
3 趙文清;侯小可;;基于詞共現(xiàn)圖的中文微博新聞話題識(shí)別[J];智能系統(tǒng)學(xué)報(bào);2012年05期
4 胡明生;賈志娟;雷利利;洪流;;基于共現(xiàn)分析的歷史自然災(zāi)害關(guān)聯(lián)研究[J];計(jì)算機(jī)工程與設(shè)計(jì);2013年06期
5 葛玲;蔣宗禮;;基于共現(xiàn)詞查詢的主題爬蟲研究[J];計(jì)算機(jī)工程;2010年08期
6 孫愛珍;;語境共現(xiàn)詞匯鏈的自動(dòng)提取及與語篇銜接之關(guān)系(英文)[J];Chinese Journal of Applied Linguistics;2011年04期
7 陳,
本文編號(hào):1673142
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1673142.html
最近更新
教材專著