非事實類問題的答案選取

發(fā)布時間：2018-08-07 13:50

【摘要】：隨著問答社區(qū)網(wǎng)站的興起，越來越多的用戶生成數(shù)據(jù)積累了起來。這些用戶生成數(shù)據(jù)不僅具有海量的、多樣性的等特點，還有著極高的質(zhì)量和重用價值。為了高效的管理和利用這些數(shù)據(jù)，近年來研究人員基于這些數(shù)據(jù)進行了大量的研究和實踐，而社區(qū)問答就是一個被廣泛研究的課題。社區(qū)問答研究基于問答社區(qū)數(shù)據(jù)，與傳統(tǒng)的問答系統(tǒng)有著明顯的不同。傳統(tǒng)問答系統(tǒng)主要解決以短語和命名實體為答案的事實類問題，主要模塊是問題理解和答案抽取。而社區(qū)問答則沒有這樣的限制，并且其特別適合回答詢問建議或觀點的非事實類問題。社區(qū)問答研究涵蓋問題檢索與推薦、問題的興趣度、問題和答案的質(zhì)量、答案的排序、用戶權(quán)威性等研究方向。其中問題檢索和答案的選取作為社區(qū)問答的核心模塊更是受到了學(xué)術(shù)界和工業(yè)界的廣泛關(guān)注。本課題主要工作為構(gòu)建一個基于大規(guī)模問答社區(qū)數(shù)據(jù)的社區(qū)問答系統(tǒng)，并對其中涉及的問題分析、問題檢索和答案選取技術(shù)進行了深入的研究。社區(qū)問答系統(tǒng)構(gòu)建過程中，本課題收集了來自Yahoo! Answers等社區(qū)網(wǎng)站的超過1.3億問題和10億答案的大規(guī)模數(shù)據(jù)，和之前的基于百萬量級的數(shù)據(jù)的問答社區(qū)相關(guān)研究工作相比有著明顯的不同和極高的實用價值。在此數(shù)據(jù)的基礎(chǔ)上，，本課題通過查詢自動分類方法來提高每次查詢效率和效果。在問題檢索過程中，本課題提出了應(yīng)用查詢問句和候選問題的結(jié)構(gòu)信息和語義信息，并結(jié)合排序?qū)W習(xí)算法來融合多種不同類別的特征。通過訓(xùn)練數(shù)據(jù)生成排序模型來提高問題檢索的相關(guān)性和詞語不匹配等問題。實驗表明，本課題應(yīng)用Ranking SVM方法來訓(xùn)練的排序模型在不同數(shù)據(jù)集上，其準確率等評價指標上都相比以往的方法有著顯著的提高。在通過問題檢索找到與查詢問句語義相似的候選問題后，本課題還提出了一個基于問答對的內(nèi)容信息的新的無監(jiān)督學(xué)習(xí)方法，來判定答案的質(zhì)量以過濾低質(zhì)量的答案。本課題對問答社區(qū)中的數(shù)據(jù)有以下三個假設(shè)：1、一個問題下的大部分答案都是正常的，只有少部分答案是低質(zhì)量的需要被過濾掉；2、低質(zhì)量答案可以通過對比同一問題下的其他答案而被檢測出來；3、不同的答案應(yīng)該有不同的判定答案質(zhì)量高低的標準�；谝陨霞僭O(shè)，本課題應(yīng)用基于內(nèi)容的特征，通過最小化答案特征向量的方差，同時盡可能多的保留答案的方式來對檢測低質(zhì)量答案。實驗表明，該方法相比于基準方法在ROC數(shù)值上有著明顯的提高。在低質(zhì)量答案過濾之后，本課題還應(yīng)用問答對的文本信息和社區(qū)網(wǎng)站回答者的權(quán)威性信息，通過問答社區(qū)中的用戶選出的最佳答案數(shù)據(jù)和Ranking SVM算法訓(xùn)練了一個答案排序模型，來對答案進行重新排序選取最佳的答案。通過以上幾個步驟，本課題構(gòu)建了一個高效、實用的社區(qū)問答系統(tǒng)，通過300個商業(yè)搜索引擎查詢?nèi)罩局懈哳l問題的測試，有78.0%的問題可以給出正確的答案，并對于任意問句可在2秒中內(nèi)給出結(jié)果，該社區(qū)問答系統(tǒng)具有很好效果與實用性。
[Abstract]:With the rise of the question and answer community, more and more user generated data have been accumulated. These users generate data not only with mass, diversity, but also of high quality and reuse. In order to manage and use these data efficiently, researchers have done a lot of research on these data in recent years. And practice, and community Q & A is a widely studied subject.
The community question and answer study is based on the question and answer community data, which is obviously different from the traditional question answering system. The traditional question answering system mainly solves the fact class problem with the answer of the phrase and the named entity. The main module is the problem understanding and the answer extraction. The community question answer is not limited, and it is especially suitable for answering questions and ideas. The community question and answer research covers the search and recommendation of the problem, the degree of interest, the quality of the questions and answers, the order of the answers, the authority of the user and so on. The key module of the question and answer of the question is the attention of the academia and the industry.
The main work of this project is to build a community Q & a system based on the mass question and answer community data, and make an in-depth study of the problems involved in the problem analysis, the problem retrieval and the answer selection technology.
In the process of community Q & a system construction, this subject has collected more than 130 million questions and 1 billion answers from the community websites of Yahoo! Answers and so on. It has significant difference and high practical value compared with the previous question and answer community related research based on millions of data. On the basis of this data It improves the efficiency and effectiveness of each query by querying automatic classification.
In the process of problem retrieval, this topic puts forward the structure and semantic information of query questions and candidate questions, and combines the sorting learning algorithm to merge the characteristics of various different categories. Through training data generating sorting model to improve the correlation of problem retrieval and the mismatch of words, the experiment shows that this topic is applied to Ran The ranking model trained by King SVM has a remarkable improvement in accuracy and other evaluation indexes compared with the previous methods on different data sets.
A new unsupervised learning method based on QA based content information is proposed to find the quality of answers to filter low quality answers. This subject has three hypotheses in the question and answer community: 1, a large part under a problem. Only a few answers are normal, only a few answers are low quality needs to be filtered out; 2, low quality answers can be detected by comparing other answers to the same problem; 3, different answers should have different criteria for determining the quality of the answers. Based on the above hypothesis, the subject applies the features based on content, through the above hypothesis. The variance of the answer eigenvectors is minimized and the answers are kept as many as possible to detect low quality answers. Experiments show that the method has a significant increase in the ROC value compared to the benchmark method.
After the low quality answer filtering, the subject also uses the text information of the question answer pair and the authoritative information of the responders of the community website, and trains an answer sorting model through the best answer data selected by the user in the question and answer community and the Ranking SVM algorithm, to select the best answer to the answer by a new sort. Step, this project constructs an efficient and practical community Q & a system, and through 300 commercial search engines to test the high frequency problem in the log, 78% of the questions can give the correct answer, and the question can be given the result in 2 seconds. The community question answering system has good effect and practicability.
【學(xué)位授予單位】：哈爾濱工業(yè)大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2013
【分類號】：TP391.1

【相似文獻】

相關(guān)期刊論文前10條

1 賈君枝;毛海飛;;漢語框架網(wǎng)絡(luò)問答系統(tǒng)問句處理研究[J];圖書情報工作;2008年10期

2 王君;李舟軍;胡俠;胡必云;;一種新的復(fù)合核函數(shù)及在問句檢索中的應(yīng)用[J];電子與信息學(xué)報;2011年01期

3 黨琰,張冬茉,李芳;角色反演算法在問答系統(tǒng)中的應(yīng)用[J];計算機工程與應(yīng)用;2004年36期

4 張曉孿;王西鋒;;中文問答系統(tǒng)中語義角色標注的研究與實現(xiàn)[J];科學(xué)技術(shù)與工程;2008年10期

5 秦兵,劉挺,王洋,鄭實福,李生;基于常問問題集的中文問答系統(tǒng)研究[J];哈爾濱工業(yè)大學(xué)學(xué)報;2003年10期

6 付鴻鵠;基于W eb的開放領(lǐng)域問答系統(tǒng)研究[J];現(xiàn)代圖書情報技術(shù);2005年09期

7 高明霞;劉椿年;;基于模糊描述邏輯的PNL網(wǎng)絡(luò)問答系統(tǒng)[J];計算機工程;2006年21期

8 王樹西;趙星秋;潘碩;;問答系統(tǒng)在教學(xué)中的應(yīng)用[J];中國教育信息化;2007年07期

9 杜瑋;邸書靈;孫樹靜;;基于互聯(lián)網(wǎng)技術(shù)的問答系統(tǒng)研究[J];微計算機信息;2007年36期

10 陳敏杰;;問答系統(tǒng)中問題分析模塊的實現(xiàn)[J];經(jīng)營管理者;2009年13期

相關(guān)會議論文前10條

1 何靖;陳

本文編號：2170221

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2170221.html

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

非事實類問題的答案選取