基于維基百科語料的新聞文本詞匯鏈構(gòu)建技術(shù)研究
本文選題:自然語言處理 + 維基百科。 參考:《昆明理工大學(xué)》2017年碩士論文
【摘要】:一個(gè)高效的信息文本處理方法可以很好地對新聞文本進(jìn)行快速處理,從而得到人們需要的文本類別、關(guān)鍵詞以及更深層次的語義內(nèi)涵和語義關(guān)系。詞匯鏈的構(gòu)建對新聞文本的快速處理有著重要意義,相比傳統(tǒng)基于頻率和機(jī)器學(xué)習(xí)的關(guān)鍵詞提取方法,詞匯鏈基于網(wǎng)絡(luò)語料庫,融合了人類的認(rèn)知,由于網(wǎng)絡(luò)語料資源庫高速的更新頻率和合理的結(jié)構(gòu)分類關(guān)系,由詞匯鏈對新聞文本進(jìn)行進(jìn)一步研究較其他方法有著更好的效果。目前已有的中文詞匯鏈構(gòu)建方法不能很好地解決詞語歧義消歧問題,構(gòu)建的詞匯鏈也往往不能正確表達(dá)文本的語義聚類關(guān)系,既而影響著抽取關(guān)鍵詞的質(zhì)量。為了幫助讀者更快的掌握新聞文本的主旨含義、確定新聞篇章結(jié)構(gòu),本文從以下幾個(gè)方面展開研究:(1)基于維基百科的分類結(jié)構(gòu)圖和文檔鏈接信息圖兩大特征分別利用候選詞的路徑信息進(jìn)行深度加權(quán)路徑長度(DPL)算法計(jì)算節(jié)點(diǎn)深度之間的關(guān)系;利用文檔分類信息基于解釋的文本向量進(jìn)行明確語義分析(ESA)算法計(jì)算詞與詞之間的相關(guān)度,從而進(jìn)行詞匯鏈的初步構(gòu)建,并且考慮候選詞權(quán)重改善了文本關(guān)鍵詞提取的算法,結(jié)合新聞文本5個(gè)特征項(xiàng)對初建詞匯鏈優(yōu)化,以在門戶網(wǎng)站上爬取得1500多篇新聞文本為語料對本文中所提到的詞匯鏈構(gòu)建算法進(jìn)行試驗(yàn),將獲取的關(guān)鍵詞與其他關(guān)鍵詞提取的方法進(jìn)行對比試驗(yàn),得出的結(jié)果表明本文的詞匯鏈構(gòu)建方法所提取出來的關(guān)鍵詞效果更好。(2)基于維基百科語料資源庫的從屬關(guān)系、資源庫自身結(jié)構(gòu)特性以及鏈接復(fù)現(xiàn)特性與經(jīng)典MGKM2003方法結(jié)合構(gòu)建出MGKM-WIKI消歧算法對初選詞匯鏈進(jìn)行進(jìn)一步消歧;將MGKM-WIKI消歧算法以Semval-3作為詞義消歧系統(tǒng)的候選詞數(shù)據(jù)集,與其他的有監(jiān)督消歧算法、無監(jiān)督消歧算法進(jìn)行了對比試驗(yàn),得到了較好的效果。(3)在完成詞匯鏈構(gòu)建的基礎(chǔ)上,利用對齊技術(shù)實(shí)現(xiàn)越南語新聞文本的詞匯鏈構(gòu)建工作,并爬取大量越南語新聞文本對構(gòu)建方法進(jìn)行試驗(yàn)。(4)結(jié)合以上研究內(nèi)容設(shè)計(jì)原型系統(tǒng),通過本系統(tǒng)可實(shí)現(xiàn)對漢語和越南語新聞文本的詞匯鏈構(gòu)建,使讀者快速掌握新聞主旨、確定新聞篇章結(jié)構(gòu)。
[Abstract]:An efficient information text processing method can be used to process news texts quickly, so as to obtain the text categories, keywords and deeper semantic connotations and semantic relationships that people need. The construction of lexical chain is of great significance to the rapid processing of news texts. Compared with the traditional keyword extraction methods based on frequency and machine learning, the lexical chain is based on the network corpus, which combines human cognition. Because of the high updating frequency and reasonable structure classification relationship of the online corpus, the further study of news text by lexical chain has better results than other methods. The existing Chinese lexical chain construction methods can not solve the problem of word ambiguity, and the constructed lexical chain often can not correctly express the semantic clustering relationship of the text, which affects the quality of the extracted keywords. In order to help readers grasp the main meaning of news texts more quickly, to determine the structure of news texts, In this paper, the following aspects are studied: (1) based on Wikipedia classification structure diagram and document link information graph, the DPL algorithm is used to calculate the relationship between node depth using candidate word path information. The document classification information is used to calculate the correlation between words and words by explicit semantic analysis (ESA) algorithm based on interpretive text vector, so that the lexical chain is constructed preliminarily, and the weight of candidate words is considered to improve the algorithm of text keyword extraction. Combining five feature items of news text to optimize the newly built lexical chain, and taking more than 1500 news texts crawled on the portal as the corpus, this paper attempts to test the lexical chain construction algorithm mentioned in this paper. The results show that the proposed method is more effective than other methods. (2) based on the subordinate relationship of Wikipedia corpus, the proposed method is more effective than other methods. Combined with the classic MGKM2003 method, the MGKM-WIKI disambiguation algorithm is used to further disambiguate the primary lexical chain, and Semval-3 is used as the candidate word data set in the MGKM-WIKI disambiguation algorithm. Compared with other supervised disambiguation algorithms and unsupervised disambiguation algorithms, the results are satisfactory. (3) on the basis of the construction of lexical chain, the construction of lexical chain of Vietnamese news texts is realized by using alignment technology. And crawl a large number of Vietnamese news texts to test the construction method. (4) combined with the above research content design prototype system, through this system can realize the construction of Chinese and Vietnamese news text vocabulary chain, so that readers can quickly grasp the news purport. Determine the structure of the news text.
【學(xué)位授予單位】:昆明理工大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2017
【分類號】:TP391.1
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 孫琛琛;申德榮;單菁;聶鐵錚;于戈;;WSR:一種基于維基百科結(jié)構(gòu)信息的語義關(guān)聯(lián)度計(jì)算算法[J];計(jì)算機(jī)學(xué)報(bào);2012年11期
2 盛志超;陶曉鵬;;基于維基百科的語義相似度計(jì)算方法[J];計(jì)算機(jī)工程;2011年07期
3 劉軍;姚天f ;;基于Wikipedia的語義相關(guān)度計(jì)算[J];計(jì)算機(jī)工程;2010年19期
4 方俊;郭雷;王曉東;;基于語義的關(guān)鍵詞提取算法[J];計(jì)算機(jī)科學(xué);2008年06期
5 ;Keyword Extraction Based on tf/idf for Chinese News Document[J];Wuhan University Journal of Natural Sciences;2007年05期
6 張敏;耿煥同;王煦法;;一種利用BC方法的關(guān)鍵詞自動提取算法研究[J];小型微型計(jì)算機(jī)系統(tǒng);2007年01期
7 索紅光;劉玉樹;曹淑英;;一種基于詞匯鏈的關(guān)鍵詞抽取方法[J];中文信息學(xué)報(bào);2006年06期
8 王軍;詞表的自動豐富——從元數(shù)據(jù)中提取關(guān)鍵詞及其定位[J];中文信息學(xué)報(bào);2005年06期
9 李素建,王厚峰,俞士汶,辛乘勝;關(guān)鍵詞自動標(biāo)引的最大熵模型應(yīng)用研究[J];計(jì)算機(jī)學(xué)報(bào);2004年09期
10 韓客松,王永成;中文全文標(biāo)引的主題詞標(biāo)引和主題概念標(biāo)引方法[J];情報(bào)學(xué)報(bào);2001年02期
相關(guān)碩士學(xué)位論文 前1條
1 劉琦;一種基于WordNet上下文的詞義消歧算法[D];吉林大學(xué);2014年
,本文編號:2099827
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2099827.html