面向文本分類任務(wù)的主題強(qiáng)化詞句嵌入模型研究

發(fā)布時(shí)間：2018-12-17 09:02

【摘要】：近年來,深度學(xué)習(xí)在自然語言處理領(lǐng)域受到了更多的重視,基于深度學(xué)習(xí)的神經(jīng)語言模型和詞句嵌入模型相繼被提出,這類模型以其高準(zhǔn)確率、低復(fù)雜度的優(yōu)點(diǎn)被學(xué)術(shù)界和工業(yè)界廣泛研究和應(yīng)用。然而,將原本依賴語言模型分布假設(shè)的詞句嵌入模型直接用于文本分類等任務(wù),顯然是不合適的,因?yàn)槲谋痉诸惾蝿?wù)所需要的是高極性的主題特征,而原詞句嵌入模型只是單純的捕捉語言規(guī)律,沒有重視主題信息的挖掘。為了使基于深度學(xué)習(xí)的詞句嵌入模型更加適合應(yīng)用到文本分類任務(wù)中,本文對(duì)原模型進(jìn)行主題強(qiáng)化,提出了主題強(qiáng)化的詞句嵌入模型,期望獲得更高的文本分類性能。由于語義極性相反的單詞可能擁有相似的局部上下文,而原模型只利用局部上下文訓(xùn)練該單詞的分布式嵌入表示,是無法捕捉到具有相反極性的語義的。因此,本文提出用高階純依賴建模詞句嵌入模型中的長(zhǎng)程上下文,從而加強(qiáng)詞句分布式嵌入表示的情感或者主題信息,進(jìn)而提高情感分析和主題挖掘任務(wù)的性能。高階純依賴方法有嚴(yán)格的理論依據(jù)保證長(zhǎng)程上下文單詞間的依賴是“純”的,即單詞依賴是一個(gè)完整的語義實(shí)體,并且單詞的聯(lián)合概率分布不能夠被條件分解(當(dāng)然也不能被非條件分解)。這樣保證了高階的單詞依賴不能夠分解成幾個(gè)低階依賴的隨機(jī)共現(xiàn),從而高階純依賴可以有效地建模出語義豐富的、非歧義的主題信息。本文將主題強(qiáng)化的詞句嵌入模型應(yīng)用到基于標(biāo)準(zhǔn)數(shù)據(jù)集的情感分析和主題挖掘任務(wù)中,均超過了所有現(xiàn)有模型的性能。在中文新聞?wù)Z料的分類項(xiàng)目中,與詞袋模型、LDA主題模型特征作對(duì)比,分別應(yīng)用了線性和非線性分類器,從多角度調(diào)研了其分類結(jié)果,證明了主題強(qiáng)化的詞句嵌入模型完全可以與現(xiàn)有主流文本特征提取方法相競(jìng)爭(zhēng)。
[Abstract]:In recent years, more and more attention has been paid to deep learning in the field of natural language processing. Neural language models and sentence embedding models based on deep learning have been proposed one after another. The advantages of low complexity have been widely studied and applied in academia and industry. However, it is obviously inappropriate to embed words and sentences that rely on the hypothesis of linguistic model distribution to be directly used in tasks such as text categorization, because the task of text categorization requires highly polar thematic features. The original sentence embedding model only captures the language rules and does not pay attention to the topic information mining. In order to make the word-sentence embedding model based on in-depth learning more suitable for the task of text classification, this paper proposes a topic enhancement model for the original model, which is expected to achieve higher text classification performance. Because a word with opposite semantic polarity may have similar local context, the original model can only use local context to train the distributed embedded representation of the word, so it is impossible to capture the semantic with opposite polarity. Therefore, this paper proposes to embed the long term context in the model with high order pure dependency, so as to enhance the emotional or topic information expressed by the distributed embedding of words and phrases, and then improve the performance of emotion analysis and topic mining tasks. The high-order pure dependency method has strict theoretical basis to ensure that the dependency between words in long term context is "pure", that is, word dependency is a complete semantic entity. And the joint probability distribution of words can not be decomposed by condition (and certainly not by non-conditional decomposition). This ensures that high-order word dependencies cannot be decomposed into several low-order dependencies of random co-occurrence, so that high-order pure dependencies can effectively model semantic rich, non-ambiguous subject information. In this paper, we apply the topic enhanced sentence embedding model to the emotional analysis and topic mining tasks based on the standard data set, which is superior to the performance of all the existing models. In the classification items of Chinese news corpus, compared with word bag model and LDA thematic model, linear and nonlinear classifiers are used, and the classification results are investigated from many angles. It is proved that the topic-enhanced word-sentence embedding model can compete with the existing mainstream text feature extraction methods.
【學(xué)位授予單位】：天津大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2016
【分類號(hào)】：TP391.1

【相似文獻(xiàn)】

相關(guān)期刊論文前10條

1 李天鐸;利用塑料模型設(shè)計(jì)高技術(shù)產(chǎn)品[J];管理科學(xué)文摘;1999年02期

2 高俊波;安博文;王曉峰;;在線論壇中潛在影響力主題的發(fā)現(xiàn)研究[J];計(jì)算機(jī)應(yīng)用;2008年01期

3 吳玲達(dá),謝毓湘,欒悉道,肖鵬;互聯(lián)網(wǎng)多媒體主題信息自動(dòng)收集與處理系統(tǒng)的研制[J];計(jì)算機(jī)應(yīng)用研究;2005年05期

4 常躍中;;計(jì)算機(jī)在建筑模型設(shè)計(jì)中的應(yīng)用[J];中國科技信息;2006年02期

5 王灝,王換招,劉洪斐;一個(gè)分布式入侵檢測(cè)系統(tǒng)模型的設(shè)計(jì)[J];微機(jī)發(fā)展;2003年01期

6 蔣凡,高俊波,張敏,王煦法;BBS中主題發(fā)現(xiàn)原型系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)[J];計(jì)算機(jī)工程與應(yīng)用;2005年31期

7 劉洪星;陳明;;PowerDesigner設(shè)計(jì)XER模型的方法[J];武漢理工大學(xué)學(xué)報(bào)(信息與管理工程版);2006年02期

8 ;其它計(jì)算機(jī)與系統(tǒng)[J];電子科技文摘;2003年01期

9 ;TV Game秀[J];網(wǎng)絡(luò)與信息;2004年01期

10 周亦鵬;杜軍平;;基于時(shí)空情境模型的主題跟蹤[J];華南理工大學(xué)學(xué)報(bào)(自然科學(xué)版);2012年08期

相關(guān)會(huì)議論文前10條

1 馬智;杜雪濤;羅楓;;基于模式的網(wǎng)絡(luò)資源模型設(shè)計(jì)[A];中國通信學(xué)會(huì)信息通信網(wǎng)絡(luò)技術(shù)委員會(huì)2009年年會(huì)論文集（上冊(cè)）[C];2009年

2 張霖;;面向復(fù)雜系統(tǒng)仿真的模型工程[A];新觀點(diǎn)新學(xué)說學(xué)術(shù)沙龍文集58：復(fù)雜系統(tǒng)建模仿真中的困惑和思考[C];2011年

3 吳晨;宋丹;薛德軍;師慶輝;;科技主題識(shí)別及表示[A];第五屆全國信息檢索學(xué)術(shù)會(huì)議論文集[C];2009年

4 熊方;王曉宇;鄭駿;周傲英;;ITED:一種基于鏈接的主題提取和主題發(fā)現(xiàn)系統(tǒng)[A];第十九屆全國數(shù)據(jù)庫學(xué)術(shù)會(huì)議論文集（研究報(bào)告篇）[C];2002年

5 王玉婷;杜亞軍;涂騰濤;;基于Web鏈接的主題爬行蟲初始URL的研究[A];第四屆全國信息檢索與內(nèi)容安全學(xué)術(shù)會(huì)議論文集（上）[C];2008年

6 李洪波;;從業(yè)務(wù)需求分析到軟件業(yè)務(wù)模型設(shè)計(jì)[A];創(chuàng)新驅(qū)動(dòng)，加快戰(zhàn)略性新興產(chǎn)業(yè)發(fā)展——吉林省第七屆科學(xué)技術(shù)學(xué)術(shù)年會(huì)論文集（上）[C];2012年

7 寧曉莉;尤揚(yáng);葛培勤;;基于狀態(tài)的Fuzz測(cè)試模型設(shè)計(jì)與實(shí)現(xiàn)[A];全國計(jì)算機(jī)安全學(xué)術(shù)交流會(huì)論文集·第二十五卷[C];2010年

8 李韜;周亮;;一種多屬性識(shí)別的模型設(shè)計(jì)[A];2008年中國西部青年通信學(xué)術(shù)會(huì)議論文集[C];2008年

9 鮑培明;;XML的語義結(jié)構(gòu)模型設(shè)計(jì)[A];第二十二屆中國數(shù)據(jù)庫學(xué)術(shù)會(huì)議論文集（研究報(bào)告篇）[C];2005年

10 馮少卿;都云程;施水才;;基于模板的網(wǎng)頁主題信息抽取[A];第三屆全國信息檢索與內(nèi)容安全學(xué)術(shù)會(huì)議論文集[C];2007年

相關(guān)重要報(bào)紙文章前5條

1 關(guān)石;層次化模型[N];計(jì)算機(jī)世界;2001年

2 本報(bào)記者劉玉杰;模型藝術(shù)：寓于建筑超越建筑[N];建筑時(shí)報(bào);2003年

3 邱桂奇;品筑模型：創(chuàng)新第一服務(wù)制勝[N];中國房地產(chǎn)報(bào);2012年

4 記者李鵬;鄔榮領(lǐng)：能預(yù)測(cè)生物未來的人[N];北京科技報(bào);2012年

5 記者沙星海　見習(xí)記者毛璽璽;一大學(xué)生開發(fā)出解決業(yè)內(nèi)難題軟件[N];平頂山日?qǐng)?bào);2010年

相關(guān)博士學(xué)位論文前8條

1 余化鵬;復(fù)雜場(chǎng)景下的目標(biāo)檢測(cè)技術(shù)研究[D];電子科技大學(xué);2015年

2 楊肖;基于主題的互聯(lián)網(wǎng)信息抓取研究[D];浙江大學(xué);2014年

3 馬威;云計(jì)算環(huán)境中高保證隔離模型及關(guān)鍵技術(shù)研究[D];北京交通大學(xué);2016年

4 趙一鳴;基于多維尺度分析的潛在主題可視化研究[D];華中師范大學(xué);2013年

5 吳永輝;面向?qū)I(yè)領(lǐng)域的網(wǎng)絡(luò)信息采集及主題檢測(cè)技術(shù)研究與應(yīng)用[D];哈爾濱工業(yè)大學(xué);2010年

6 薛利;面向證券應(yīng)用的WEB主題觀點(diǎn)挖掘若干關(guān)鍵問題研究[D];復(fù)旦大學(xué);2013年

7 陶軍;基于非合作博弈模型的QoS分配中關(guān)鍵技術(shù)的研究[D];東南大學(xué);2005年

8 周厚奎;概率主題模型的研究及其在多媒體主題發(fā)現(xiàn)和演化中的應(yīng)用[D];浙江大學(xué);2017年

相關(guān)碩士學(xué)位論文前10條

1 邢寧;面向文本分類任務(wù)的主題強(qiáng)化詞句嵌入模型研究[D];天津大學(xué);2016年

2 葛麗娟;基于出租汽車運(yùn)營數(shù)據(jù)的交通基礎(chǔ)模型研究及計(jì)算[D];長(zhǎng)安大學(xué);2015年

3 劉軒;最優(yōu)統(tǒng)計(jì)套利模型[D];上海交通大學(xué);2015年

4 鄭茂;篇章級(jí)聯(lián)想模型的研究與實(shí)現(xiàn)[D];哈爾濱工業(yè)大學(xué);2015年

5 解琰;主題優(yōu)化過濾方法研究與應(yīng)用[D];大連海事大學(xué);2015年

6 楊春艷;基于語義和引用加權(quán)的文獻(xiàn)主題提取研究[D];浙江大學(xué);2015年

7 盧洋;基于主題模型的混合推薦算法研究[D];電子科技大學(xué);2014年

8 黃志;基于維基歧義頁的搜索結(jié)果聚類方法研究[D];北京理工大學(xué);2015年

9 王亮;基于主題模型的文本挖掘的研究[D];大連理工大學(xué);2015年

10 任昱鳳;基于Hadoop的分布式主題爬蟲及其實(shí)現(xiàn)[D];陜西師范大學(xué);2015年

，

本文編號(hào)：2383992

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2383992.html

上一篇：輸電線路螺栓緊固帶電作業(yè)機(jī)器人的視覺搜索、識(shí)別與定位方法
下一篇：結(jié)合W4算法和LBP模型的運(yùn)動(dòng)目標(biāo)檢測(cè)方法

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級(jí)|國家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

面向文本分類任務(wù)的主題強(qiáng)化詞句嵌入模型研究