基于維基百科的短文本相關(guān)度計(jì)算
本文關(guān)鍵詞:基于維基百科的短文本相關(guān)度計(jì)算 出處:《太原理工大學(xué)》2017年碩士論文 論文類型:學(xué)位論文
更多相關(guān)文章: 維基百科 相關(guān)性 短文本 語(yǔ)義關(guān)聯(lián)度 關(guān)聯(lián)規(guī)則
【摘要】:隨著移動(dòng)通信技術(shù)與社交媒體的發(fā)展,中文短文本形式的信息已滲透在社會(huì)和生活的各個(gè)領(lǐng)域。巨大信息量的增長(zhǎng)也催生出巨大的使用價(jià)值,如何挖掘出這些文本的深層價(jià)值成為了一個(gè)熱門話題。因此自然語(yǔ)言處理成為了研究者的研究熱點(diǎn)。語(yǔ)義相關(guān)度計(jì)算作為自然語(yǔ)言處理領(lǐng)域一項(xiàng)基本性的研究工作,被廣泛地應(yīng)用于查詢擴(kuò)展、詞義消歧、機(jī)器翻譯、知識(shí)抽取、自動(dòng)糾錯(cuò)等領(lǐng)域。而短文本作為一種新興的文本信息源,字?jǐn)?shù)較少,所表述的概念信號(hào)弱、特征信息模糊,因而難以抽取有效的特征信息。鑒于短文本所表達(dá)的信息有限,因此需要大量的背景知識(shí)來(lái)對(duì)樣本特征進(jìn)行擴(kuò)展。維基百科作為目前世界上最大的、多語(yǔ)種的、開(kāi)放式的在線百科全書(shū),得到很多研究者的青睞,因此本文選擇中文維基百科作為外部語(yǔ)料庫(kù),維基百科的結(jié)構(gòu)信息以及語(yǔ)義信息也為短文本語(yǔ)義分析提供了基礎(chǔ)。本文將短文本分為詞語(yǔ)和句子兩部分,首先提出了一種基于維基百科的詞語(yǔ)間相關(guān)度的計(jì)算方法。該方法主要結(jié)合維基百科中的結(jié)構(gòu)信息及語(yǔ)義信息,維基百科的主要結(jié)構(gòu)包括分類體系結(jié)構(gòu)、摘要中的鏈接結(jié)構(gòu)、正文中鏈接結(jié)構(gòu)以及重定向消歧頁(yè)等,提出一種綜合類別相關(guān)度與鏈接相關(guān)度的計(jì)算詞語(yǔ)間相關(guān)度的方法。為了探究詞語(yǔ)語(yǔ)義深層信息,提出了利用關(guān)聯(lián)規(guī)則計(jì)算詞語(yǔ)相關(guān)度的計(jì)算方法。在此基礎(chǔ)上,本文提出了句子間相關(guān)度的計(jì)算方法,主要從三大方面入手:句子結(jié)構(gòu)間的相關(guān)度計(jì)算、基于詞對(duì)的相關(guān)度計(jì)算以及利用聚類對(duì)主題詞加權(quán)的聚類相關(guān)度計(jì)算。其中,句子結(jié)構(gòu)又包括兩方面:詞形和詞序。在詞形相關(guān)度計(jì)算上,主要通過(guò)計(jì)算詞共現(xiàn)的頻率來(lái)體現(xiàn);在詞序計(jì)算上,通過(guò)逆序數(shù)的計(jì)算來(lái)體現(xiàn);谠~對(duì)的相關(guān)度計(jì)算主要考慮句子中詞語(yǔ)的深度語(yǔ)義信息,更符合人類主觀認(rèn)識(shí)。聚類主要是將語(yǔ)義相關(guān)的詞語(yǔ)或文本聚為一類或一簇,本文將其利用到句子間相關(guān)度的計(jì)算上,提高句子相關(guān)度計(jì)算的準(zhǔn)確率。在理論方法成型的基礎(chǔ)上,完成實(shí)驗(yàn)方案的設(shè)計(jì)。首先,下載處理中文維基百科語(yǔ)料;其次完成詞語(yǔ)以及句子間相關(guān)度的計(jì)算;最后將計(jì)算結(jié)果與人工標(biāo)注集進(jìn)行對(duì)比,本實(shí)驗(yàn)選用了人工翻譯Word Similarity-353測(cè)試集以及國(guó)防科技大學(xué)所統(tǒng)計(jì)的Words-240作為詞語(yǔ)相關(guān)度的測(cè)試集,句子相關(guān)度的測(cè)試集選擇中國(guó)數(shù)據(jù)庫(kù)萬(wàn)維網(wǎng)知識(shí)提取大賽所提供的短文本語(yǔ)義相關(guān)度比賽評(píng)測(cè)數(shù)據(jù)集,通過(guò)對(duì)比Spearman參數(shù)和準(zhǔn)確率等相關(guān)系數(shù),在詞語(yǔ)相關(guān)度計(jì)算方面,本文方法的Spearman參數(shù)比傳統(tǒng)算法提高2.8%,句子相關(guān)度準(zhǔn)確率達(dá)到73.3%,取得較好實(shí)驗(yàn)效果。證明了本文方法的合理性和實(shí)用性。
[Abstract]:With the rapid development of mobile communication technology and social media, Chinese short text information has penetrated in all fields of society and life. The large amount of information growth has also spawned a huge value, how to dig out the deep value of these texts has become a hot topic. Therefore, Natural Language Processing has become a research hotspot of researchers. The research work of semantic relevance calculation as a basic Natural Language Processing field, is widely used in word sense disambiguation, query expansion, Machine Translation, knowledge extraction, automatic error correction and other fields. And this essay as a new text information source, fewer words, concepts expressed in the weak signal, fuzzy feature information, feature so it is difficult to extract effective information. In view of the expression of short text information is limited, so a lot of background knowledge need to be extended to the wiki hundred sample characteristics. At present, as the world's largest, multilingual, open online encyclopedia, by many researchers of all ages, so this thesis chooses Chinese Wikipedia as an external corpus, provides the basis of the structure of Wikipedia information and semantic information for short text semantic. The short text is divided into two parts: words and sentences, first of all based on Wikipedia word correlation calculation method. This method is based on the structural information and semantic information in Wikipedia, Wikipedia's main structure including the classification system structure, link structure abstract, text link structure and page redirection disambiguation, this paper proposes a comprehensive method and related categories the link correlation calculation of correlation degree between words. In order to explore the deep semantic information, proposes the use of association rules to calculate the correlation of the words Calculation method. On this basis, this paper puts forward the calculation method of correlation degree between sentences, mainly from three aspects: the calculation of correlation between sentence structure, correlation calculation of the clustering and the use of theme words weighted clustering correlation calculation. Based on the sentence structure and consists of two aspects: the form and word order. In the calculation of correlation form, which reflected by calculating word co-occurrence frequency; word order in calculation, embodied by the reverse calculation of the number of the words of the correlation calculation. The main deep semantics of words in sentences based on the information, more consistent with human subjective understanding. Clustering is mainly semantic Related words or text together as a class or a cluster, this paper will use to calculate the correlation between the sentence, to improve the accuracy of calculating the correlation of the sentence. Based on theoretical methods of forming on the complete experimental design at first. Download Wikipedia, Chinese corpus; secondly to complete the calculation of correlation degree between words and sentences; the results were compared with manual annotation, we choose the Word Similarity-353 manual translation test set and the National University of Defense Technology statistics Words-240 as word correlation test set sentence correlation test set selection Chinese web database knowledge extraction contest provides short text semantic correlation match data sets, the correlation coefficient compared Spearman parameters and accuracy, calculating the relationship of words, Spearman parameter method in this paper is 2.8% higher than the traditional sentence correlation algorithm, the accuracy rate reached 73.3%, achieved good experimental results proved that this method. The rationality and practicability.
【學(xué)位授予單位】:太原理工大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:TP391.1
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 孫琛琛;申德榮;單菁;聶鐵錚;于戈;;WSR:一種基于維基百科結(jié)構(gòu)信息的語(yǔ)義關(guān)聯(lián)度計(jì)算算法[J];計(jì)算機(jī)學(xué)報(bào);2012年11期
2 涂新輝;張紅春;周琨峰;何婷婷;;中文維基百科的結(jié)構(gòu)化信息抽取及詞語(yǔ)相關(guān)度計(jì)算方法[J];中文信息學(xué)報(bào);2012年03期
3 范云杰;劉懷亮;;基于維基百科的中文短文本分類研究[J];現(xiàn)代圖書(shū)情報(bào)技術(shù);2012年03期
4 汪祥;賈焰;周斌;丁兆云;梁政;;基于中文維基百科鏈接結(jié)構(gòu)與分類體系的語(yǔ)義相關(guān)度計(jì)算[J];小型微型計(jì)算機(jī)系統(tǒng);2011年11期
5 王錦;王會(huì)珍;張俐;;基于維基百科類別的文本特征表示[J];中文信息學(xué)報(bào);2011年02期
6 劉軍;姚天f ;;基于Wikipedia的語(yǔ)義相關(guān)度計(jì)算[J];計(jì)算機(jī)工程;2010年19期
7 呂曉燕;羅立民;李祥生;;FCM算法的改進(jìn)及仿真實(shí)驗(yàn)研究[J];計(jì)算機(jī)工程與應(yīng)用;2009年20期
8 江敏;肖詩(shī)斌;王弘蔚;施水才;;一種改進(jìn)的基于《知網(wǎng)》的詞語(yǔ)語(yǔ)義相似度計(jì)算[J];中文信息學(xué)報(bào);2008年05期
9 戈國(guó)華;肖海波;張敏;;基于FCM的數(shù)據(jù)聚類分析及Matlab實(shí)現(xiàn)[J];福建電腦;2007年04期
10 吳勤,侯朝楨,原菊梅;基于Kohonen網(wǎng)絡(luò)的軟件可靠性模型選擇[J];計(jì)算機(jī)應(yīng)用;2005年10期
相關(guān)博士學(xué)位論文 前1條
1 李峗;基于中文維基百科的語(yǔ)義知識(shí)挖掘相關(guān)研究[D];北京郵電大學(xué);2009年
,本文編號(hào):1388896
本文鏈接:http://sikaile.net/shoufeilunwen/xixikjs/1388896.html