天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當(dāng)前位置:主頁 > 科技論文 > 搜索引擎論文 >

面向問答系統(tǒng)的問題分類與答案抽取研究

發(fā)布時間:2018-06-02 13:07

  本文選題:問答系統(tǒng) + 問題分類 ; 參考:《東北大學(xué)》2013年碩士論文


【摘要】:隨著人工智能、信息檢索以及自然語言處理等技術(shù)地發(fā)展,問答系統(tǒng)的研究也取得了長足地進(jìn)步。特別是TREC等會議舉辦的問答系統(tǒng)評測任務(wù)又進(jìn)一步推動了問答系統(tǒng)的發(fā)展。相比英文而言,中文領(lǐng)域并沒有流行的問答系統(tǒng)評測,相關(guān)的數(shù)據(jù)集也十分匱乏,導(dǎo)致目前中文問答系統(tǒng)的研究相對落后。本文使用基于在線搜索引擎的方式來實現(xiàn)答案檢索,主要研究工作為面向中文問答系統(tǒng)的問題分析與答案抽取。 在問題分析部分,本文首先提出了一種基于詞組合及問題類別的停用詞選取方法,在選取停用詞時先從由n個詞組合成的短語中提取,并且在提取過程中考慮問題類別情況,通過不斷減小n的值來完成迭代。在本文的數(shù)據(jù)集上,這一方法取得了較好的效果。 接著,針對本文的問題,基于TF-IDF的思想提出了一種問句分類特征選取方法TFC-ICF。該方法綜合考慮了一個詞語標(biāo)識某個類別的能力及其在各個類別中的分布情況,從而可以選取到質(zhì)量更高的分類特征。本文使用基于SVM模型的分類器來進(jìn)行自動分類,使用TFC-ICF方法選取的特征詞在問題分類上的準(zhǔn)確率可以達(dá)到80.45%。為了進(jìn)一步提高問題分類的性能,本文以TFC-ICF方法為基準(zhǔn),提出了人工特征選取方法、基于關(guān)鍵詞擴展的特征選取和選取語法信息的特征選取,并且在后兩種方法中實驗了多種不同的特征使用方法。通過與TFC-ICF方法結(jié)合使用,上述三種方法在問題分類上的最高準(zhǔn)確率分別可以達(dá)到86.01%、85.14%和82.13%。 在答案抽取部分,本文首先討論了如何使用基于向量空間模型的句子相似度計算方法選取候選答案句子,進(jìn)而使用實體識別的方法從候選答案句子中提取與問句類別相關(guān)的實體,最后,本文提出了一種基于句子相似度和實體信息的答案抽取方法,在NTCIR5的CLQA問答測試集上取得了較好的實驗結(jié)果。 本文對問題分類和答案抽取做了重點研究,并得到了一些成果,但其中也存在一定的問題,比如,問題數(shù)據(jù)集質(zhì)量較差、實體識別的效果還不能完全令人滿意、最終答案抽取的效果也不夠理想。
[Abstract]:With the development of artificial intelligence, information retrieval and natural language processing, the research of question answering system has made great progress. Especially, the evaluation task of Q & A system held by TREC and other conferences has further promoted the development of Q & A system. Compared with English, there is no popular question answering system evaluation in the Chinese field, and the related data sets are also very scarce, which leads to the relatively backward research on the Chinese question answering system at present. In this paper, an online search engine is used to realize the answer retrieval. The main research work is question analysis and answer extraction for Chinese question answering system. In the part of problem analysis, this paper first proposes a method of selecting stop words based on word combination and problem categories. When selecting stop words, we first extract them from phrases composed of n words, and consider the situation of problem categories in the process of extraction. The iteration is completed by continuously reducing the value of n. On the data set in this paper, this method has achieved good results. Then, in order to solve the problem in this paper, a method of feature selection of question classification based on TF-IDF is proposed. In this method, the ability of a word to identify a certain category and its distribution in each category are considered synthetically, so that the classification features of higher quality can be selected. In this paper, a classifier based on SVM model is used for automatic classification. The accuracy of feature words selected by TFC-ICF method in problem classification can reach 80.45%. In order to further improve the performance of problem classification, based on the TFC-ICF method, this paper proposes a method of artificial feature selection, which is based on keyword expansion and feature selection of selected syntax information. And in the latter two methods, we have experimented with many different feature usage methods. By combining with TFC-ICF method, the highest accuracy of the above three methods in problem classification can reach 86.01% and 82.13% respectively. In the part of answer extraction, this paper first discusses how to select candidate answer sentences by using the method of sentence similarity calculation based on vector space model. Then the entity recognition method is used to extract the entity related to the question sentence category from the candidate answer sentence. Finally, this paper proposes a method based on sentence similarity and entity information to extract the answer. Good experimental results are obtained on the CLQA quiz test set of NTCIR5. This paper focuses on the problem classification and answer extraction, and gets some results, but there are some problems, such as the poor quality of the problem data set, the effect of entity recognition is not completely satisfactory. The final answer extraction effect is not ideal.
【學(xué)位授予單位】:東北大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2013
【分類號】:TP391.3

【參考文獻(xiàn)】

相關(guān)期刊論文 前5條

1 顧益軍,樊孝忠,王建華,汪濤,黃維金;中文停用詞表的自動選取[J];北京理工大學(xué)學(xué)報;2005年04期

2 邱錫鵬;繆有棟;黃萱菁;;基于主動學(xué)習(xí)的中文問題分類數(shù)據(jù)集構(gòu)建[J];哈爾濱工業(yè)大學(xué)學(xué)報;2012年05期

3 文勖;張宇;劉挺;馬金山;;基于句法結(jié)構(gòu)分析的中文問題分類[J];中文信息學(xué)報;2006年02期

4 黃昌寧;趙海;;中文分詞十年回顧[J];中文信息學(xué)報;2007年03期

5 馮志偉;;自然語言處理的歷史與現(xiàn)狀[J];中國外語;2008年01期

,

本文編號:1968871

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1968871.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶09e9a***提供,本站僅收錄摘要或目錄,作者需要刪除請E-mail郵箱bigeng88@qq.com