基于短語句法組塊的中文FAQ問答系統(tǒng)研究
本文關(guān)鍵詞: 中文問答系統(tǒng) 受限域 問句分類 組塊 編輯距離 問句相似度 出處:《昆明理工大學(xué)》2013年碩士論文 論文類型:學(xué)位論文
【摘要】:問答系統(tǒng)是自然語言處理領(lǐng)域的一個(gè)重要方向,旨在讓用戶直接用自然語言提問并獲得答案。相對(duì)于傳統(tǒng)關(guān)鍵詞方式的搜索引擎來說,自動(dòng)問答系統(tǒng)具有顯著的優(yōu)勢。在受限域,基于FAQ(常問問題)的問答系統(tǒng)把用戶經(jīng)常提問的問題和相關(guān)的答案組織在一起,在問題答案的定位上,更準(zhǔn)確,快捷和高效,在日常生活的各個(gè)領(lǐng)域,有著重要的應(yīng)用前景,是當(dāng)前研究的熱點(diǎn)。本文主要利用自然語言處理技術(shù),對(duì)受限域的中文問句分類,問句的組塊分析,問句相似度計(jì)算等問答系統(tǒng)關(guān)鍵技術(shù)進(jìn)行探討與研究,并在此基礎(chǔ)上實(shí)現(xiàn)了云南旅游領(lǐng)域FAQ問答原型系統(tǒng)。具體來說,本文主要取得了以下幾個(gè)較有特色的成果: (1)針對(duì)傳統(tǒng)的概率統(tǒng)計(jì)方法進(jìn)行問句分類,分類器的訓(xùn)練只依賴于問句中特征詞的出現(xiàn)頻率,沒有考慮到問句中詞與詞之間的語義關(guān)系的問題,本文提出了一種語義相似度與隱Markov序列分析模型相結(jié)合的問句分類方法。該方法首先提取所有問句類別的特征詞集作為不同隱Markov模型分類器的觀察序列,其次以不同類別問句特征詞集的形成演化過程作為狀態(tài)轉(zhuǎn)換序列,最后,通過詞語語義相似度計(jì)算方法計(jì)算出特征詞在不同類別狀態(tài)下的觀測值概率分布,分別構(gòu)建不同類型的問句隱Markov分類模型。對(duì)旅游領(lǐng)域問句進(jìn)行了分類實(shí)驗(yàn),結(jié)果表明提出的方法比現(xiàn)有方法在準(zhǔn)確率上有一定的提高。 (2)現(xiàn)有的組塊分析方法中,主要是通過詞語字面信息和統(tǒng)計(jì)特征來進(jìn)行組塊,沒有考慮到不同類型問句的句法結(jié)構(gòu)特征。針對(duì)以上問題,本文提出了一種基于短語句法樹的中文問句組塊分析方法。該方法首先在已經(jīng)獲取問句類別的基礎(chǔ)上,結(jié)合問句的提問方式和詞法特征,分析問句的句型,歸納總結(jié)出不同問句的結(jié)構(gòu)形態(tài)。然后利用短語句法分析器生成問句的短語句法樹,最后結(jié)合領(lǐng)域問句的特性,自定義組塊規(guī)則,對(duì)領(lǐng)域問句進(jìn)行組塊的識(shí)別和標(biāo)注。實(shí)驗(yàn)結(jié)果表明,該方法具有較好的效果。 (3)針對(duì)現(xiàn)有的漢語句子相似度計(jì)算方法,沒有充分利用句子詞匯語義信息和句子結(jié)構(gòu)信息的問題,本文提出了一種基于改進(jìn)編輯距離的領(lǐng)域問句相似度計(jì)算方法。該方法以組塊取代字符作為基本的編輯單元,根據(jù)領(lǐng)域問句的特點(diǎn),對(duì)不同的詞賦予不同的權(quán)重,并通過知網(wǎng)計(jì)算塊內(nèi)詞語相似度來衡量塊間的替換代價(jià),對(duì)不同類型的組塊賦予不同的插入、刪除代價(jià)。實(shí)驗(yàn)結(jié)果表明,該方法具有較好的效果。 (4)利用上述研究成果,并以云南旅游領(lǐng)域?yàn)槔?對(duì)領(lǐng)域問句進(jìn)行分類,組塊分析和標(biāo)注,設(shè)計(jì)并實(shí)現(xiàn)了云南旅游FAQ問答原型系統(tǒng)。
[Abstract]:Question answering system is an important direction in the field of natural language processing, which aims to let users directly use natural language to ask questions and get answers. Automatic question answering system has significant advantages. In restricted domain, FAQ-based question answering system organizes users' frequently asked questions and related answers together, and is more accurate, fast and efficient in the positioning of question answers. In every field of daily life, it has an important application prospect and is a hot research topic at present. This paper mainly uses natural language processing technology, classifies Chinese question sentence in restricted domain, and analyzes the block of question sentence. The key technologies of question answering system such as question similarity calculation are discussed and studied, and the prototype system of FAQ question answering in Yunnan tourism field is implemented on this basis. The training of classifier only depends on the frequency of feature words in question sentences, and does not take into account the semantic relationship between words and words in question sentences. In this paper, a semantic similarity method combined with the hidden Markov sequence analysis model is proposed, in which the feature word sets of all question categories are extracted as observation sequences of different hidden Markov model classifiers. Secondly, the formation and evolution of feature word sets of different types of questions are taken as the sequence of state transition. Finally, the probability distribution of the observed values of feature words in different categories is calculated by the method of semantic similarity calculation. Different types of implicit Markov classification models of question sentences are constructed, and the classification experiments of question sentences in tourism field are carried out. The results show that the proposed method is more accurate than the existing methods. (2) in the existing methods of block analysis, it is mainly through the literal information and statistical features of words, and the syntactic structure characteristics of different types of question sentences are not taken into account. In this paper, a method of Chinese question block analysis based on phrase syntax tree is proposed. The structure of different questions is summed up. Then the phrase syntax tree of question is generated by using phrase parser. Finally, according to the characteristics of domain questions, the block rules are defined. The block recognition and tagging of domain questions are carried out. The experimental results show that the proposed method is effective. (3) aiming at the problem that the existing Chinese sentence similarity calculation methods do not make full use of the semantic information of sentence vocabulary and sentence structure information, In this paper, a method for calculating the similarity of domain question sentences based on improved editing distance is proposed, in which block substitution for characters is used as the basic editing unit. According to the characteristics of domain questions, different words are given different weights. The similarity of words in blocks is calculated to measure the substitution cost of blocks, and different insertion and deletion costs are given to different types of blocks. The experimental results show that the proposed method is effective. Using the above research results and taking Yunnan tourism field as an example, this paper classifies, analyzes and annotates the domain questions, and designs and implements the FAQ question answering prototype system of Yunnan tourism.
【學(xué)位授予單位】:昆明理工大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2013
【分類號(hào)】:TP391.1
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 樊孝忠,李宏喬,李良富,葉江;銀行領(lǐng)域漢語自動(dòng)問答系統(tǒng)BAQS的研究與實(shí)現(xiàn)[J];北京理工大學(xué)學(xué)報(bào);2004年06期
2 夏天,樊孝忠,劉林,駱正華;基于ALICE的漢語自然語言接口[J];北京理工大學(xué)學(xué)報(bào);2004年10期
3 呂學(xué)強(qiáng),任飛亮,黃志丹,姚天順;句子相似模型和最相似句子查找算法[J];東北大學(xué)學(xué)報(bào);2003年06期
4 劉挺;馬金山;;漢語自動(dòng)句法分析的理論與方法[J];當(dāng)代語言學(xué);2009年02期
5 王樹西,劉群,白碩;一個(gè)人物關(guān)系問答的專家系統(tǒng)[J];廣西師范大學(xué)學(xué)報(bào)(自然科學(xué)版);2003年01期
6 秦兵,劉挺,王洋,鄭實(shí)福,李生;基于常問問題集的中文問答系統(tǒng)研究[J];哈爾濱工業(yè)大學(xué)學(xué)報(bào);2003年10期
7 趙軍,黃昌寧;結(jié)合句法組成模板識(shí)別漢語基本名詞短語的概率模型[J];計(jì)算機(jī)研究與發(fā)展;1999年11期
8 李素建,劉群,白碩;統(tǒng)計(jì)和規(guī)則相結(jié)合的漢語組塊分析[J];計(jì)算機(jī)研究與發(fā)展;2002年04期
9 李鑫;黃萱菁;吳立德;;基于錯(cuò)誤驅(qū)動(dòng)算法組合分類器及其在問題分類中的應(yīng)用[J];計(jì)算機(jī)研究與發(fā)展;2008年03期
10 李素建;基于語義計(jì)算的語句相關(guān)度研究[J];計(jì)算機(jī)工程與應(yīng)用;2002年07期
,本文編號(hào):1555388
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1555388.html