社區(qū)型問答中問句檢索關(guān)鍵技術(shù)研究
本文選題:社區(qū)型問答 + 問句檢索; 參考:《哈爾濱工業(yè)大學(xué)》2014年博士論文
【摘要】:隨著Web2.0時(shí)代的到來,社區(qū)型問答漸漸成為人們在網(wǎng)絡(luò)上獲取知識(shí)和信息的必要途徑。相對于互聯(lián)網(wǎng)搜索引擎而言,社區(qū)型問答能夠直接返回用戶提出的自然語言形式問句的答案,而不是需要用戶自行篩選的檢索結(jié)果列表。相對于傳統(tǒng)的開放域問答系統(tǒng)而言,社區(qū)型問答中的答案都是由真實(shí)用戶生成的,其質(zhì)量要高于傳統(tǒng)的開放域問答系統(tǒng)自動(dòng)地從候選文檔中抽取和生成的答案。同時(shí),由于社區(qū)型問答中積累了大量的問答對資源,因此,社區(qū)型問答中的核心問題和關(guān)鍵技術(shù)體現(xiàn)在檢索相似的已回答問句并返回相應(yīng)的答案,我們稱之為問句檢索。 然而,社區(qū)型問答中的問句檢索面臨的三個(gè)主要挑戰(zhàn)為:由于用戶問句表述的冗長性導(dǎo)致的用戶意圖理解困難;由于用戶問句表述多樣性造成的問句之間的詞項(xiàng)不匹配問題;由于未能考慮問句的社區(qū)屬性而導(dǎo)致問句檢索的排序僅僅依靠文本相關(guān)性。因此,在本文中,我們從以下四個(gè)方面來解決上述三個(gè)關(guān)鍵問題,從而在整體上提高社區(qū)型問答中問句檢索的性能。 本文的第二章提出了基于依存句法關(guān)系圖的詞項(xiàng)重要度賦權(quán)方法,從而解決了社區(qū)型問答中用戶問句查詢的冗長性問題。具體地,對于已有的基于詞項(xiàng)賦權(quán)的問句檢索模型而言,一個(gè)主要的問題是在計(jì)算詞項(xiàng)權(quán)重時(shí)忽略了詞項(xiàng)之間的聯(lián)系。為了解決這個(gè)問題,我們提出了一種新的利用詞項(xiàng)之間依存句法關(guān)系作為線索的詞項(xiàng)賦權(quán)機(jī)制。對于給定問句,我們首先構(gòu)建依存句法圖來計(jì)算每個(gè)詞項(xiàng)對的關(guān)聯(lián)強(qiáng)度,進(jìn)而我們根據(jù)依存關(guān)聯(lián)度來更新常規(guī)的詞項(xiàng)權(quán)重。我們驗(yàn)證了更新后的詞項(xiàng)權(quán)重能夠有效地整合到已有的問句檢索模型中,且實(shí)驗(yàn)結(jié)果相比于已有最新穎的問句檢索模型有了顯著的提升。 本文的第三章提出了基于短語復(fù)述的問句重構(gòu)模型,提高了問句查詢擴(kuò)展的整體效果。具體地,由于語言表述的多樣性所導(dǎo)致的問句檢索中的詞項(xiàng)不匹配現(xiàn)象,已經(jīng)成為社區(qū)型問答中亟待解決的問題。為了解決這個(gè)問題,我們提出了一種基于短語級復(fù)述方法的問句重構(gòu)機(jī)制,從而提高了問句檢索的效果。給定一個(gè)問句查詢,我們首先結(jié)合語料庫統(tǒng)計(jì)信息和問句內(nèi)部線索的特征來識(shí)別問句中的關(guān)鍵短語;接下來,我們通過融合多個(gè)在線翻譯引擎的翻譯結(jié)果來進(jìn)行關(guān)鍵短語的復(fù)述抽。蛔詈,我們提出一種基于解碼算法的問句重構(gòu)方法,在融合關(guān)鍵短語的基礎(chǔ)上,生成重構(gòu)問句。通過在社區(qū)型問答數(shù)據(jù)集上的問句檢索實(shí)驗(yàn)效果的提升,驗(yàn)證了我們所提出的問句重構(gòu)算法的有效性,并且在問句檢索上顯著優(yōu)于當(dāng)前的最新穎的問句檢索模型。 本文的第四章提出了基于主題翻譯及聚類模型,實(shí)現(xiàn)問句查詢中詞項(xiàng)的擴(kuò)展。具體地,基于統(tǒng)計(jì)機(jī)器翻譯模型的問句檢索模型,其相關(guān)性排序機(jī)制主要依賴于詞項(xiàng)間的翻譯概率,然而已有的機(jī)器翻譯模型沒有很好地控制詞項(xiàng)之間的翻譯噪聲,使得當(dāng)前的問句檢索模型存在不完善之處。我們提出一種基于主題翻譯及聚類模型的問句檢索模型,從理論上說明,該模型利用主題的推理及主題之間的相似性信息,達(dá)到控制翻譯模型噪聲的效果,從而提高問句檢索的結(jié)果。實(shí)驗(yàn)結(jié)果表明,我們提出的模型在MAP、MRR以及p@1等指標(biāo)上顯著優(yōu)于當(dāng)前最新穎的問句檢索模型。 本文的第五章提出了問句流行度預(yù)測問題,并以此提高用戶問句檢索結(jié)果。具體地,隨著社區(qū)型問答的發(fā)展,其上積累了大量高質(zhì)量的問答對資源。這些資源不僅能夠讓用戶進(jìn)行問句檢索的操作,更重要的是允許用戶之間進(jìn)行交互。在問答社區(qū)上面,大多數(shù)研究都是基于問句的文本內(nèi)容進(jìn)行問句檢索的相關(guān)研究,而很少有研究用戶個(gè)人信息及交互行為對問句檢索結(jié)果的影響。社區(qū)型問答中,問句的流行度能夠反映用戶的關(guān)注、興趣以及交互行為,因此,,我們通過預(yù)測問句的流行度來改善用戶在問句檢索時(shí)的體驗(yàn)。我們首先通過對影響問句流行度的因素進(jìn)行分析和建模,以此來預(yù)測新問句的流行度。并通過預(yù)測出的流行度對用戶使用問句檢索的結(jié)果進(jìn)行重排序,實(shí)驗(yàn)結(jié)果表明,基于流行度重排序的問句檢索結(jié)果優(yōu)于基于檢索相關(guān)度的問句檢索結(jié)果。
[Abstract]:With the advent of the Web2.0 era, community interrogation has gradually become a necessary way for people to acquire knowledge and information on the Internet. Relative to Internet search engines, community type questions and answers can directly return to the answers to natural language questions raised by users, rather than the list of retrieval results that need to be screened by users themselves. In the open domain question answering system, the answers in the community type questions and answers are generated by the real users. Their quality is higher than the traditional open domain question answering system automatically extracts and generates the answers from the candidate documents. At the same time, a large number of questions and answers are accumulated in the community quiz. The key technology is to retrieve similar answer questions and return corresponding answers, which we call question search.
However, the three main challenges in the question answer search in the community type question answer are that the user's intention is difficult to understand because of the verbose description of the user's questions, and the problem of the mismatch between the words between the questions caused by the diversity of the user's question expression, and the sort of question retrieval due to the failure to consider the community attributes of the question. Therefore, in this article, we solve the above three key problems in the following four aspects, so as to improve the performance of the query in the community quiz.
The second chapter of this paper proposes a method of weighting the importance of word items based on dependency parsing graph, which solves the verbose problem of query in the question answer of the community type question and answer. In order to solve this problem, we propose a new word term empowerment mechanism that uses the interdependent syntactic relationship as a clue. For a given question, we first construct dependency parsing graph to calculate the correlation intensity of each word pair, and then we update the conventional word term weight according to the dependency correlation degree. The weight of the updated word item can be effectively integrated into the existing query model, and the experimental results have been improved significantly compared with the most novel query model.
In the third chapter of this paper, a question sentence reconstruction model based on phrase rehearsal is proposed to improve the overall effect of question query expansion. Specifically, the problem of word item mismatch in the query of question retrieval caused by the diversity of language expression has become an urgent problem in the community type question answer. In order to solve this problem, we put forward a new question. For a question sentence query, we first identify the key phrases in the question sentence combining the corpus statistics and the characteristics of the interal clues in a question. In the end, we propose a method of reconstructing the question sentence based on the decoding algorithm, which is based on the fusion of key phrases. Through the improvement of the experimental results on the question answer data set in the community type question and answer data set, we verify the validity of the question reconstruction algorithm and search the question sentence. It is significantly better than the current most novel query retrieval model.
The fourth chapter of this paper is based on topic translation and clustering model to realize the extension of word items in question query. Specifically, the query model based on statistical Machine Translation model is based on the probability of translation between words. However, the existing Machine Translation model does not control the translation between words well. Noise makes the current question retrieval model imperfections. We propose a query model based on topic translation and clustering model. In theory, the model uses the reasoning of the subject and the similarity information between subjects to control the effect of the noise of the translation model, thus improving the result of the question retrieval. The results show that our proposed model is significantly better than the current most innovative query retrieval model in terms of MAP, MRR and p@1.
The fifth chapter of this paper puts forward the question of the popularity of question and raises the result of user query. In particular, with the development of the community type question and answer, it has accumulated a large number of high quality questions and answers to the resources. These resources not only allow users to carry out the operation of query, but more importantly, allow users to interact. In answer to the community, most of the studies are based on interrogative text content for questions related to query, and few of the impact of user personal information and interactive behavior on query results. In community type questions and answers, the popularity of questions can reflect users' attention, interest and interaction behavior. Therefore, we predict the question through the question. The popularity of the sentence improves the user's experience in question retrieval. First, we analyze and model the factors that affect the popularity of the question sentences, in order to predict the popularity of the new questions, and reorder the user's query results through the predicted popularity. The experimental results show that the question based on the popularity reordering is the question. The result of sentence retrieval is better than that of query retrieval based on retrieval relevance.
【學(xué)位授予單位】:哈爾濱工業(yè)大學(xué)
【學(xué)位級別】:博士
【學(xué)位授予年份】:2014
【分類號(hào)】:TP391.3
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 王君;李舟軍;胡俠;胡必云;;一種新的復(fù)合核函數(shù)及在問句檢索中的應(yīng)用[J];電子與信息學(xué)報(bào);2011年01期
2 姚蘭;林鴻飛;林原;馬云龍;;基于句法特征的冗長查詢處理技術(shù)[J];北京大學(xué)學(xué)報(bào)(自然科學(xué)版);2013年02期
3 范宇峰;陳佳佳;趙占波;;問答社區(qū)用戶知識(shí)分享意向的影響因素研究[J];財(cái)貿(mào)研究;2013年04期
4 余偉;王明文;萬劍怡;左家莉;;結(jié)合語義的位置語言模型[J];北京大學(xué)學(xué)報(bào)(自然科學(xué)版);2013年02期
5 蒲強(qiáng);何大慶;楊國緯;;一種基于統(tǒng)計(jì)語義聚類的查詢語言模型估計(jì)[J];計(jì)算機(jī)研究與發(fā)展;2011年02期
6 張中峰;李秋丹;;社區(qū)問答系統(tǒng)研究綜述[J];計(jì)算機(jī)科學(xué);2010年11期
7 王品;黃廣君;;信息檢索中的句子相似度計(jì)算[J];計(jì)算機(jī)工程;2011年12期
8 鄭誠;李清;劉福君;;改進(jìn)的VSM算法及其在FAQ中的應(yīng)用[J];計(jì)算機(jī)工程;2012年17期
9 延霞;范士喜;;基于問答社區(qū)的海量問句檢索關(guān)鍵技術(shù)研究[J];計(jì)算機(jī)應(yīng)用與軟件;2013年07期
10 韓如冰;葉得學(xué);;基于VSM的權(quán)重改進(jìn)文檔相似度算法研究[J];軟件;2012年10期
本文編號(hào):2040980
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2040980.html