維基百科在IR4QA系統(tǒng)中的應用研究
發(fā)布時間:2018-06-29 00:07
本文選題:問答系統(tǒng) + IR4QA; 參考:《武漢科技大學》2012年碩士論文
【摘要】:問答系統(tǒng)是新一代智能搜索引擎,它允許用戶以自然語言提問,并能夠向用戶返回準確的答案。所以,與傳統(tǒng)的搜索引擎相比,問答系統(tǒng)能更好的滿足用戶的查詢要求,更準確地檢索出用戶所需要的答案。本文主要基于NTCIR8中所做的工作,研究的是問題理解和信息檢索這兩個中文問答系統(tǒng)中的主要部分,即IR4QA階段的研究,并最終實現(xiàn)了這個IR4QA系統(tǒng)。 問題理解階段是所有涉及到自然語言接口系統(tǒng)的研究內(nèi)容,是問答系統(tǒng)開始執(zhí)行的第一個階段,這個階段的分析結果對后面的幾個階段的處理有著重大的影響;信息檢索階段在問答系統(tǒng)中處于中間的執(zhí)行階段,它的分析結果將會極大地影響后續(xù)模塊的結果質量。本文通過比較和分析一般問答系統(tǒng)中這兩個階段目前存在的問題,找出更有效的處理方法應用在我們的系統(tǒng)中。 本文在前人的研究基礎上作了如下的一些工作: (1)整理并分析國內(nèi)外有關自動問答系統(tǒng)與搜索引擎技術的研究現(xiàn)狀,結合兩種系統(tǒng)的長處,對于當前使用者在運用搜索引擎時出現(xiàn)的搜索結果冗雜、花費時間長、結果準確度不高等一些問題,提出了將維基百科應用于自動問答系統(tǒng)的方法,即基于維基百科的IR4QA系統(tǒng),設計并實現(xiàn)了該系統(tǒng)。 (2)通過分析系統(tǒng)最終達到的效果,本文在系統(tǒng)設計初期就制定了一系列切實可行的方法。以這些方法為基礎,同時采用分層以及模塊化的設計思想,確定了系統(tǒng)的設計原則,并將系統(tǒng)分為索引生成模塊、問題分析模塊、查詢擴展模塊、文檔檢索模塊和文檔重排模塊。 (3)研究了系統(tǒng)中涉及到的一些關鍵技術,對實現(xiàn)過程中遇到的難點做了理論和技術的積累,并提出了切實可行的解決方案。 (4)在問題分類時,結合問題集中問題的特點,并考慮到漢語語法和語義分析的龐大工作任務,提高系統(tǒng)的質量,系統(tǒng)沒有采用一般用在英文問答系統(tǒng)里面的機器學習的問題分類方法,而是利用啟發(fā)式的規(guī)則,通過識別問題中的疑問詞來工作的。這對于問題集中的這些句法簡單的問題能達到良好的識別效果。 (5)對于問題與查詢文檔中存在的詞不匹配的情況,采用了基于維基百科的查詢擴展方法,包括維基頁面的查找、相關段落的定位和擴展詞的選取。通過實驗對比證明此方法能夠有效地提高檢索結果的質量。 (6)為了進一步提高檢索結果的準確率,系統(tǒng)還在文檔重排模塊使用BM25算法對檢索結果進行重排,,重排后得到最終的檢索結果。
[Abstract]:Q & A system is a new generation of intelligent search engine, it allows users to ask questions in natural language, and can return accurate answers to users. Therefore, compared with the traditional search engine, the Q & A system can better meet the query requirements of users and more accurately retrieve the answers that users need. Based on the work done in NTCIR8, this paper studies the two main parts of the Chinese question answering system, namely, IR4QA, and finally implements the IR4QA system. The problem understanding stage is the research content of all the natural language interface systems, which is the first stage of the question answering system. The analysis results of this stage have a great influence on the processing of the later several stages. The information retrieval stage is in the middle of the execution stage in the question and answer system, and its analysis results will greatly affect the quality of the results of the subsequent modules. In this paper, by comparing and analyzing the problems existing in the two stages of the general question answering system, we find out more effective methods to be applied in our system. On the basis of previous studies, this paper has done some work as follows: (1) sorting out and analyzing the research status of automatic question answering system and search engine technology at home and abroad, combining the advantages of the two systems, In this paper, the author puts forward a method of applying Wikipedia to the automatic question answering system, that is, IR4QA system based on Wikipedia, for some problems, such as miscellaneous search results, long time consuming, low accuracy of results and so on, which appear when users use search engines. The system is designed and implemented. (2) by analyzing the effect of the system, a series of feasible methods have been developed in the early stage of the system design. Based on these methods, the design principles of the system are determined by adopting the idea of layering and modularization, and the system is divided into three modules: index generation module, problem analysis module, query expansion module, and so on. Document retrieval module and document rearrangement module. (3) some key technologies involved in the system are studied, and the difficulties encountered in the process of implementation are accumulated in theory and technology. And put forward practical solutions. (4) in the process of problem classification, considering the characteristics of problem focus and taking into account the huge task of Chinese grammar and semantic analysis, the quality of the system can be improved. The system does not adopt the problem classification method which is generally used in the English question answering system, but uses heuristic rules to identify the question words in the question. These simple syntactic problems in the problem set can achieve a good recognition effect. (5) for the case where the question does not match the words in the query document, the method of query expansion based on Wikipedia is used. Including wiki page search, the location of relevant paragraphs and the selection of extension words. Experimental results show that this method can effectively improve the quality of retrieval results. (6) in order to further improve the accuracy of retrieval results, the system also uses BM25 algorithm to rearrange the retrieval results in the document rearrangement module. The final retrieval results are obtained after the rearrangement.
【學位授予單位】:武漢科技大學
【學位級別】:碩士
【學位授予年份】:2012
【分類號】:TP391.3
【參考文獻】
相關期刊論文 前9條
1 黃德根,朱和合,王昆侖,楊元生,鐘萬勰;基于最長次長匹配的漢語自動分詞[J];大連理工大學學報;1999年06期
2 李振星,徐澤平,唐衛(wèi)清,唐榮錫;全二分最大匹配快速分詞算法[J];計算機工程與應用;2002年11期
3 王樹西;問答系統(tǒng):核心技術、發(fā)展趨勢[J];計算機工程與應用;2005年18期
4 王雙成;林士敏;陸玉昌;;貝葉斯網(wǎng)絡結構學習分析[J];計算機科學;2000年10期
5 孫茂松,肖明,鄒嘉彥;基于無指導學習策略的無詞表條件下的漢語自動分詞[J];計算機學報;2004年06期
6 韓客松,王永成,陳桂林;漢語語言的無詞典分詞模型系統(tǒng)[J];計算機應用研究;1999年10期
7 吳友政,趙軍,段湘煜,徐波;問答式檢索技術及評測研究綜述[J];中文信息學報;2005年03期
8 丁國棟;白碩;王斌;;一種基于局部共現(xiàn)的查詢擴展方法[J];中文信息學報;2006年03期
9 牛耘,朱獻有;神經(jīng)網(wǎng)絡技術在漢語歧義切分中的應用[J];情報學報;1999年03期
本文編號:2079950
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2079950.html
最近更新
教材專著