基于詞三角的短文本主題模型算法

發(fā)布時(shí)間：2018-07-21 10:18

【摘要】：隨著社交網(wǎng)絡(luò)和問答網(wǎng)站的興起,短文本已成為網(wǎng)絡(luò)上信息傳遞的主要方式,例如傳統(tǒng)網(wǎng)頁的標(biāo)題、各類新聞和博客的標(biāo)題等都是短文本形式的。同時(shí),微博、知乎、Twitter、Facebook等網(wǎng)站的主要數(shù)據(jù)形式也是短文本。因此,從短文本中挖掘主題信息有著廣泛的應(yīng)用場(chǎng)景,例如從微博中發(fā)現(xiàn)突發(fā)性話題,利用文本主題信息進(jìn)行個(gè)性化推薦等等。主題模型是有效的從文本中挖掘潛在主題信息的方法,然而由于短文本中"文檔-詞"的數(shù)據(jù)過于稀疏,傳統(tǒng)的主題模型在短文本主題挖掘方面的效果并不理想。針對(duì)傳統(tǒng)主題模型在短文本領(lǐng)域的局限性,本文提出了一種新的短文本主題模型——網(wǎng)絡(luò)詞三角主題模型(WTTM),克服了數(shù)據(jù)稀疏性的問題,并在試驗(yàn)中取得了較理想的效果。本文的主要工作如下:1)針對(duì)普通詞網(wǎng)絡(luò)無法指示不同文檔子網(wǎng)絡(luò)交匯部分的問題,對(duì)詞網(wǎng)絡(luò)構(gòu)建策略做出了改進(jìn),利用詞對(duì)出現(xiàn)的文檔序號(hào)集合作為對(duì)應(yīng)邊的標(biāo)簽,使得可以通過對(duì)比兩條邊的標(biāo)簽來判斷對(duì)應(yīng)的兩個(gè)詞對(duì)是否來自同一文檔,從而判斷其是否處于文檔交匯處;2)針對(duì)普通"詞-詞"共現(xiàn)關(guān)系語義關(guān)聯(lián)較弱的問題,提出了從詞網(wǎng)絡(luò)中尋找特定詞三角結(jié)構(gòu)的策略,挖掘詞網(wǎng)絡(luò)中代表著詞之間更強(qiáng)主題關(guān)聯(lián)性的詞三角結(jié)構(gòu),詞三角中的詞有著更強(qiáng)的語義關(guān)聯(lián)性,和更強(qiáng)的主題集中性;3)以詞三角為文本主題的基本單元,提出了網(wǎng)絡(luò)詞三角主題模型(WTTM),并與LDA和BTM進(jìn)行了對(duì)比實(shí)驗(yàn),實(shí)驗(yàn)結(jié)果表明,在短文本主題挖掘方面,WTTM相對(duì)于LDA和BTM都具有一定優(yōu)勢(shì);4)在詞三角結(jié)構(gòu)的基礎(chǔ)上進(jìn)行詞團(tuán)結(jié)構(gòu)的拓展,分析詞團(tuán)中的節(jié)點(diǎn)個(gè)數(shù)對(duì)主題挖掘效果的影響。隨著詞團(tuán)中節(jié)點(diǎn)個(gè)數(shù)的增加,詞團(tuán)主題模型主題聚合度的實(shí)驗(yàn)結(jié)果也取得了一定提升。
[Abstract]:With the rise of social networks and question-and-answer websites, short text has become the main way of information transmission on the network, such as the title of traditional web pages, the titles of various news and blogs are short text forms. At the same time, Weibo, Twitter, Facebook and other sites such as the main data form is also short text. Therefore, mining topic information from short text has a wide range of application scenarios, such as the discovery of sudden topics from Weibo, the use of text topic information for personalized recommendation and so on. Topic model is an effective method to mine potential topic information from text. However, due to the sparse data of "document-word" in short text, the traditional topic model is not effective in short text topic mining. In view of the limitation of traditional theme model in the field of short text, this paper presents a new theme model of short text, namely, Network word Triangle thematic Model (WTTM), which overcomes the problem of data sparsity and achieves satisfactory results in the experiment. The main work of this paper is as follows: (1) aiming at the problem that the common word network can not indicate the intersection of different document subnetworks, the strategy of constructing word network is improved. This makes it possible to judge whether the two pairs of words are from the same document by comparing the labels on the two sides, so as to determine whether they are at the intersection of the documents. (2) in view of the problem of weak semantic relevance of the common "word-word" co-occurrence relationship, This paper puts forward the strategy of searching for the specific word triangle structure from the word network, and excavates the word triangle structure which represents the stronger thematic relevance among the words in the word network, and the words in the word triangle have stronger semantic relevance. In this paper, we put forward the network word triangulation thematic model (WTTM) and compare it with LDA and BTM. The experimental results show that, WTTM has some advantages over LDA and BTM in short text mining. (4) on the basis of word triangle structure, we expand the lexical cluster structure and analyze the influence of the number of nodes in the lexical cluster on the effect of topic mining. With the increase of the number of nodes in the lexical cluster, the experimental results of thematic aggregation degree of the lexical cluster model are improved to a certain extent.
【學(xué)位授予單位】：南京大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2017
【分類號(hào)】：TP391.1

【相似文獻(xiàn)】

相關(guān)期刊論文前10條

1 高瑋軍;馬棟林;張其文;;一種基于本體的文本主題提取方法研究[J];計(jì)算機(jī)應(yīng)用與軟件;2012年02期

2 麻志毅,姚天順;基于情境的文本主題求解[J];計(jì)算機(jī)研究與發(fā)展;1998年04期

3 王小華;徐寧;諶志群;;基于共詞分析的文本主題詞聚類與主題發(fā)現(xiàn)[J];情報(bào)科學(xué);2011年11期

4 張其文;李明;;文本主題的自動(dòng)提取方法研究與實(shí)現(xiàn)[J];計(jì)算機(jī)工程與設(shè)計(jì);2006年15期

5 侯風(fēng)巍;郭東軍;李世磊;徐釗峰;;基于信息反饋的文本主題分類過濾方法[J];通信學(xué)報(bào);2009年S1期

6 劉興林;彭宏;馬千里;;基于增量詞集頻率的文本主題詞提取算法研究[J];計(jì)算機(jī)應(yīng)用研究;2010年09期

7 康愷;林坤輝;周昌樂;;基于主題詞頻數(shù)特征的文本主題劃分[J];計(jì)算機(jī)應(yīng)用;2006年08期

8 王科,劉淵,羅萬伯,高行宇,高常波;基于中文文本主題跟蹤的網(wǎng)絡(luò)信息分析[J];四川大學(xué)學(xué)報(bào)(工程科學(xué)版);2004年01期

9 劉菲;黃萱菁;吳立德;;利用關(guān)聯(lián)規(guī)則挖掘文本主題詞的方法[J];計(jì)算機(jī)工程;2008年07期

10 禹龍;田生偉;黃俊;;維吾爾語評(píng)論文本主題抽取研究[J];中文信息學(xué)報(bào);2013年04期

相關(guān)會(huì)議論文前1條

1 丁秉公;黃昌寧;黃德根;;文本主題識(shí)別研究及應(yīng)用[A];第二屆全國(guó)學(xué)生計(jì)算語言學(xué)研討會(huì)論文集[C];2004年

相關(guān)博士學(xué)位論文前1條

1 常鵬;基于詞共現(xiàn)的文本主題挖掘模型和算法研究[D];天津大學(xué);2010年

相關(guān)碩士學(xué)位論文前10條

1 張文躍;基于改進(jìn)shark-search算法的主題爬蟲的研究與實(shí)現(xiàn)[D];內(nèi)蒙古大學(xué);2015年

2 梁劍;基于LDA文本主題挖掘的個(gè)性化推送及其在Spark平臺(tái)的實(shí)現(xiàn)[D];華南理工大學(xué);2016年

3 吳敏;網(wǎng)絡(luò)短文本主題聚類研究[D];華中科技大學(xué);2015年

4 鄒遠(yuǎn)航;面向短文本主題發(fā)現(xiàn)及分類研究[D];南京大學(xué);2015年

5 蔡洋;基于詞三角的短文本主題模型算法[D];南京大學(xué);2017年

6 梁文婷;漢語文本主題分析技術(shù)的研究與實(shí)現(xiàn)[D];重慶大學(xué);2008年

7 蔣建慧;文本主題段落內(nèi)部概念關(guān)系抽取技術(shù)研究[D];上海交通大學(xué);2009年

8 郭劍飛;基于LDA多模型中文短文本主題分類體系構(gòu)建與分類[D];哈爾濱工業(yè)大學(xué);2014年

9 田鈺琨;基于主題鏈的海量投訴文本主題抽取方法研究[D];東北師范大學(xué);2012年

10 李宇坤;短文本主題分析的相關(guān)問題研究[D];北京郵電大學(xué);2014年

，

本文編號(hào)：2135203

資料下載

論文發(fā)表

本文鏈接：http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2135203.html

上一篇：基于像散的三維超分辨定位顯微成像技術(shù)及過程優(yōu)化
下一篇：基于預(yù)期偏差的突發(fā)金融文本分類方法研究

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級(jí)|國(guó)家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于詞三角的短文本主題模型算法