基于多策略的學(xué)術(shù)論文術(shù)語(yǔ)抽取方法研究
本文選題:多策略 + 術(shù)語(yǔ)抽取 ; 參考:《華中科技大學(xué)》2016年碩士論文
【摘要】:如何快速又準(zhǔn)確地抽取術(shù)語(yǔ)是自然語(yǔ)言處理中一項(xiàng)重要課題。面向?qū)W術(shù)論文領(lǐng)域的術(shù)語(yǔ)抽取研究能夠有效地推動(dòng)科學(xué)的發(fā)展與成果的推廣。學(xué)術(shù)論文中,術(shù)語(yǔ)在不同的位置,如標(biāo)題、關(guān)鍵字、摘要等文本塊,具有不同的分布特征。傳統(tǒng)的術(shù)語(yǔ)抽取方法忽略了術(shù)語(yǔ)分布的位置信息,因此,急需一種能夠綜合考慮術(shù)語(yǔ)位置信息的方法來(lái)彌補(bǔ)現(xiàn)有方法的不足。提出了一種面向?qū)W術(shù)論文的基于多策略的術(shù)語(yǔ)抽取方法TEM,該方法首先根據(jù)標(biāo)題、摘要和關(guān)鍵詞的不同特征,分別采用基于邊界標(biāo)記集、基于中文術(shù)語(yǔ)構(gòu)詞規(guī)則和基于關(guān)鍵詞的候選術(shù)語(yǔ)抽取策略;接著分析了候選術(shù)語(yǔ)抽取的結(jié)果及錯(cuò)誤類(lèi)型,引入術(shù)語(yǔ)反例規(guī)則字典改進(jìn)抽取結(jié)果;再結(jié)合K-近頻子串歸并算法對(duì)候選術(shù)語(yǔ)進(jìn)行篩選過(guò)濾;最后利用術(shù)語(yǔ)的位置信息,構(gòu)建了綜合評(píng)分模型,采用層次分析法決策標(biāo)題、摘要和關(guān)鍵詞三個(gè)維度的權(quán)重值,根據(jù)最終的評(píng)分排序得到正確術(shù)語(yǔ)。此外,針對(duì)單詞型術(shù)語(yǔ),在TF-IDF算法的基礎(chǔ)上引入了類(lèi)別頻率CF,提高了篩選的效果。在實(shí)驗(yàn)階段,測(cè)試了K值變化對(duì)子串歸并的影響,對(duì)比了引入CF和位置信息后術(shù)語(yǔ)抽取結(jié)果的變化。結(jié)果表明,相比于傳統(tǒng)方法,TF-IDF-CF方法的準(zhǔn)確率和召回率分別提升了5.73%和8.43%;TEM-SW方法的準(zhǔn)確率和召回率分別提升了7.85%和11.54%,TEM-MW方法的準(zhǔn)確率和召回率分別提升了11.62%和9.71%;更好地實(shí)現(xiàn)了學(xué)術(shù)論文術(shù)語(yǔ)的抽取。
[Abstract]:How to extract terms quickly and accurately is an important task in natural language processing. Term extraction for academic papers can effectively promote the development of science and the promotion of achievements. In academic papers, terms in different positions, such as titles, keywords, abstracts and other text blocks, have different distribution characteristics. The traditional term extraction method neglects the location information of term distribution, so it is urgent that a method which can consider the term location information synthetically to make up for the deficiency of the existing methods. In this paper, a multi-strategy based term extraction method (temm) for academic papers is proposed. Firstly, according to the different features of titles, abstracts and keywords, a new method based on boundary markers is proposed. The extraction strategy of candidate terms based on Chinese term formation rule and keyword is analyzed, and the results and error types of candidate term extraction are analyzed, and the dictionary of term counterexample rule is introduced to improve the extraction result. Combined with the K-Near-frequency substring merging algorithm, the candidate terms are filtered. Finally, a comprehensive scoring model is constructed by using the location information of the terms, and the weight values of the three dimensions of the AHP decision title, summary and key words are adopted. Get the correct terminology according to the final ranking. In addition, the category frequency CFS is introduced based on the TF-IDF algorithm to improve the screening effect. In the experiment stage, the influence of the change of K value on the substring merging is tested, and the variation of the term extraction results with the introduction of CF and position information is compared. The results show that Compared with the traditional TF-IDF-CF method, the accuracy and recall rate of TF-IDF-CF method were increased by 5.73% and 8.43%, respectively. The accuracy and recall rate of TEM-SW method were increased by 7.85% and 11.54%, respectively, and the recall rate of TEM-MW method was increased by 11.62% and 9.71%, respectively. Paper term extraction.
【學(xué)位授予單位】:華中科技大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2016
【分類(lèi)號(hào)】:TP391.1
【參考文獻(xiàn)】
相關(guān)期刊論文 前7條
1 袁勁松;張小明;李舟軍;;術(shù)語(yǔ)自動(dòng)抽取方法研究綜述[J];計(jì)算機(jī)科學(xué);2015年08期
2 丁杰;呂學(xué)強(qiáng);劉克會(huì);;基于邊界標(biāo)記集的專(zhuān)利文獻(xiàn)術(shù)語(yǔ)抽取方法[J];計(jì)算機(jī)工程與科學(xué);2015年08期
3 杜麗萍;李曉戈;周元哲;邵春昌;;互信息改進(jìn)方法在術(shù)語(yǔ)抽取中的應(yīng)用[J];計(jì)算機(jī)應(yīng)用;2015年04期
4 湯青;呂學(xué)強(qiáng);李卓;施水才;;領(lǐng)域本體術(shù)語(yǔ)抽取研究[J];現(xiàn)代圖書(shū)情報(bào)技術(shù);2014年01期
5 周浪;馮沖;黃河燕;王平堯;;一種基于獨(dú)立性統(tǒng)計(jì)的子串歸并算法[J];計(jì)算機(jī)工程與應(yīng)用;2010年24期
6 周浪;張亮;馮沖;黃河燕;;基于詞頻分布變化統(tǒng)計(jì)的術(shù)語(yǔ)抽取方法[J];計(jì)算機(jī)科學(xué);2009年05期
7 呂學(xué)強(qiáng),張樂(lè),黃志丹,胡俊峰;基于散列技術(shù)的快速子串歸并算法[J];復(fù)旦學(xué)報(bào)(自然科學(xué)版);2004年05期
,本文編號(hào):2107420
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2107420.html