Web社區(qū)問(wèn)答檢索的關(guān)鍵技術(shù)研究
本文選題:社區(qū)問(wèn)答服務(wù) + 答案摘要 ; 參考:《復(fù)旦大學(xué)》2014年博士論文
【摘要】:社區(qū)問(wèn)答服務(wù)是指人們通過(guò)web社區(qū)相互交流來(lái)提出問(wèn)題和獲取解答的服務(wù)。由于社區(qū)問(wèn)答系統(tǒng)中包含了許多真實(shí)人工用戶生成的知識(shí)和經(jīng)驗(yàn)分享,它已經(jīng)成為人們常用的除傳統(tǒng)搜索引擎以外比較流行的信息求助方式。在社區(qū)問(wèn)答系統(tǒng)中,用戶可以基于自然語(yǔ)言表達(dá)的方式提交問(wèn)題直接向社區(qū)中的其他用戶尋求答案,也可以通過(guò)自動(dòng)檢索得到與該提問(wèn)相似的問(wèn)題,并進(jìn)一步使用相似問(wèn)題的現(xiàn)成答案。對(duì)于大多數(shù)非事實(shí)性問(wèn)題特別是一些帶個(gè)人上下文或?qū)で蠼ㄗh的開(kāi)放性問(wèn)題,問(wèn)題檢索往往比基于自然語(yǔ)言處理和信息檢索從web文檔中抽取文檔片段并提取答案的傳統(tǒng)方法更加有效。正因?yàn)槿绱?針對(duì)web社區(qū)一般性問(wèn)題的檢索已經(jīng)成為下一代智能信息檢索的一個(gè)重要組成部分。稀疏化學(xué)習(xí)是近年來(lái)興起的新型統(tǒng)計(jì)學(xué)習(xí)方法。本文以稀疏正則化為主要工具,對(duì)社區(qū)問(wèn)答中的一系列關(guān)鍵技術(shù)開(kāi)展研究。具體而言,本文主要研究了web社區(qū)中復(fù)雜多語(yǔ)句問(wèn)題的答案摘要技術(shù),問(wèn)題的自動(dòng)層次話題分類(lèi)技術(shù)和問(wèn)題檢索模型的改進(jìn)技術(shù)。本文的主要工作和創(chuàng)新點(diǎn)如下:1.自動(dòng)答案摘要:對(duì)于社區(qū)中的復(fù)雜多語(yǔ)句問(wèn)題,即那些往往包含很多子問(wèn)題和相應(yīng)上下文的問(wèn)題,其“最佳答案”往往存在所謂的“答案不完整”缺陷--它對(duì)應(yīng)的“最佳答案”不夠全面完整,缺失了其它答案中包含的對(duì)問(wèn)題回答有用的信息。本文提出一種新穎的自動(dòng)答案摘要方法來(lái)歸納問(wèn)題的所有答案中的有價(jià)值的信息。該方法基于條件隨機(jī)場(chǎng)模型來(lái)對(duì)答案句子間的局部/非局部上下文關(guān)系進(jìn)行建模,并使用組L1正則化來(lái)對(duì)參數(shù)進(jìn)行懲罰,充分挖掘各特征的潛能。2.問(wèn)題層次分類(lèi):用戶在社區(qū)問(wèn)答系統(tǒng)上提交問(wèn)題時(shí),系統(tǒng)要求用戶為該問(wèn)題手工選擇一個(gè)層次目錄來(lái)表明問(wèn)題的話題類(lèi)別,這樣既方便系統(tǒng)將該問(wèn)題推薦給相應(yīng)話題的領(lǐng)域?qū)<胰ソ獯?也可以便利今后其他用戶的瀏覽和檢索。然而,手工給問(wèn)題進(jìn)行類(lèi)別標(biāo)注需要對(duì)整個(gè)層次目錄體系有全面認(rèn)識(shí),因而既費(fèi)時(shí)又影響用戶體驗(yàn)。為了省去手工對(duì)問(wèn)題進(jìn)行話題分類(lèi)的麻煩,本文提出一種自動(dòng)的問(wèn)題核化層次話題分類(lèi)算法,將問(wèn)題中各特征的多核學(xué)習(xí)和參數(shù)的稀疏正交約束結(jié)合起來(lái),在提升模型對(duì)相似話題類(lèi)別的判別能力的同時(shí)減少了模型的參數(shù)個(gè)數(shù)。3.問(wèn)題檢索模型:為了進(jìn)一步提高社區(qū)問(wèn)答中已有問(wèn)題的可用性,本文研究了基于自動(dòng)分類(lèi)結(jié)果改善問(wèn)題檢索效果的技術(shù)。現(xiàn)有的檢索模型在度量某個(gè)查詢?cè)~對(duì)該查詢的重要性時(shí)往往按其在查詢中出現(xiàn)的頻率來(lái)計(jì)算,這對(duì)于那些每個(gè)查詢?cè)~只出現(xiàn)一次的情形不起作用。與現(xiàn)有的檢索方法不同,我們使用稀疏化的問(wèn)題分類(lèi)方法來(lái)模擬真實(shí)用戶的層次類(lèi)別標(biāo)注過(guò)程,并且根據(jù)該過(guò)程來(lái)自動(dòng)挑選其中的重要檢索詞項(xiàng)和獲取其對(duì)該查詢的局部權(quán)重。另外,我們還對(duì)初步檢索結(jié)果進(jìn)行基于結(jié)果間相似度的重排序,進(jìn)一步提高問(wèn)題檢索的性能。本文的多數(shù)方法都使用帶有稀疏性質(zhì)的正則化項(xiàng)來(lái)約束模型的參數(shù)。這樣做有幾個(gè)好處:第一,減少了模型的參數(shù)。由于特征的減少,模型所需要的訓(xùn)練數(shù)據(jù)也相應(yīng)地減少,防止了模型因?yàn)閰?shù)太多而過(guò)擬合的情況,并且增強(qiáng)了在新數(shù)據(jù)上的泛化能力;第二,提高了模型的效率。由于參數(shù)的減少,用于存儲(chǔ)模型的空間和計(jì)算時(shí)間也有所降低;第三,有利于關(guān)系依賴的發(fā)現(xiàn)。通過(guò)稀疏化模型將那些干擾性的無(wú)關(guān)項(xiàng)去除后,模型能更加專(zhuān)注于那些真正對(duì)模型推理有幫助的特征。因此,本文中提出的稀疏化方法除了對(duì)社區(qū)問(wèn)答檢索比較有幫助,在其它web應(yīng)用如冗長(zhǎng)關(guān)鍵字檢索、web文檔分類(lèi)和摘要上也有一定的啟發(fā)意義。在真實(shí)社區(qū)問(wèn)答數(shù)據(jù)集Yahoo! Answers上的一系列實(shí)驗(yàn)結(jié)果表明,本文提出的方法無(wú)輪是與當(dāng)前較為先進(jìn)的研究方法還是與一些強(qiáng)基準(zhǔn)方法相比,準(zhǔn)確度都取得了明顯的提高。
[Abstract]:Community question and answer service refers to the service that people communicate with each other through the web community. Because of the knowledge and experience shared by many real artificial users in the community question answering system, it has become a popular way of seeking information, which is popular except for the traditional search engine. In the community question and answer system, the community Q & a system has been used in the community question answering system. In the system, users can submit questions based on the natural language expression to seek answers directly to other users in the community, or by automatically retrieving questions similar to the question, and using a ready-made answer to similar questions. For most non factual questions, especially some with individual contexts or for advice. In the open problem, problem retrieval is often more effective than the traditional method based on Natural Language Processing and information retrieval to extract document fragments and extract answers from web documents. For this reason, the retrieval of general problems in the web community has become an important part of the next generation of intelligent information retrieval. In this paper, a series of key technologies in community questions and answers are studied in this paper. In this paper, the paper mainly studies the answer summary technology of the complex and multiple sentences in the web community, the automatic hierarchical problem classification and the improvement of the problem retrieval model. The main work and innovation of this paper is as follows: 1. automatic answer summary: for the complex multiple statement problem in the community, that is, the problems that often contain many sub problems and corresponding contexts, the "best answer" often has the so-called "incomplete answer" defect -- its corresponding "best answer" is not complete and complete, missing. This paper presents a novel automatic answer summary method to sum up valuable information in all the answers to the problem. This method is based on the conditional random field model to model the local / non local contexts between the answers, and use group L1 regularization to make the reference to the reference. The number carries on the punishment, fully excavates the potential.2. problem hierarchy classification of each characteristic: when the user submits a question on the community question answering system, the system requires the user to choose a hierarchical directory to show the topic category by hand, so that it is convenient for the system to recommend the problem to the domain experts of the corresponding topic, and it can also be solved. In order to save the problem of sorting the problem by hand, an automatic problem kernel hierarchical topic classification algorithm is proposed. Multi kernel learning and sparse orthogonal constraints of parameters are combined to improve the model's discriminant ability to similar topic categories and reduce the model.3. problem retrieval model. In order to further improve the availability of existing problems in community questions and answers, this paper studies the problem retrieval based on automatic classification results to improve the problem retrieval. The existing retrieval model, when measuring the importance of a query word to the query, is often calculated according to the frequency of the query appearing in the query, which does not work for the case that each query only appears once. Unlike the existing retrieval methods, we use a thinning problem classification method to simulate real users. According to the process, the important retrieval words are selected and the local weight of the query is obtained. In addition, we also reorder the initial retrieval results based on the inter result similarity degree to further improve the performance of the problem retrieval. Most of the methods used in this paper use a sparse character. Regularization terms constrain the parameters of the model. There are several benefits: first, the parameters of the model are reduced. Due to the reduction of the characteristics, the training data required by the model are reduced accordingly. The model is prevented from overfitting the model because of too many parameters, and the generalization ability on the new data is enhanced; second, the effect of the model is improved. As a result of the reduction of parameters, the space and time for storage models have also been reduced; third, the discovery that is beneficial to relation dependence. After the removal of those independent items by the sparsity model, the model can be more focused on the features that are really helpful to the model reasoning. Therefore, the sparsity method proposed in this paper is the exception. It is helpful to the community question and answer retrieval, in other web applications such as verbose keyword search, web document classification and summary also have some enlightening significance. A series of experimental results on the real community Q & a data set Yahoo! Answers show that the method proposed in this paper is with the more advanced research methods or some of the more advanced methods. Compared with the strong benchmark method, the accuracy has been significantly improved.
【學(xué)位授予單位】:復(fù)旦大學(xué)
【學(xué)位級(jí)別】:博士
【學(xué)位授予年份】:2014
【分類(lèi)號(hào)】:TP391.3
【相似文獻(xiàn)】
相關(guān)期刊論文 前10條
1 李妍妍;李媛媛;葉世偉;;基于流形正則化的支持向量回歸及應(yīng)用[J];計(jì)算機(jī)應(yīng)用;2007年08期
2 毛玉明;郭杏林;趙巖;呂洪彬;;基于精細(xì)計(jì)算的動(dòng)載荷反演問(wèn)題正則化求解[J];動(dòng)力學(xué)與控制學(xué)報(bào);2009年04期
3 仇光;鄭淼;張暉;朱建科;卜佳俊;陳純;杭航;;基于正則化主題建模的隱式產(chǎn)品屬性抽取[J];浙江大學(xué)學(xué)報(bào)(工學(xué)版);2011年02期
4 劉超,刁現(xiàn)芬,汪元美;超聲逆散射成像問(wèn)題中的正則化方法研究[J];浙江大學(xué)學(xué)報(bào)(工學(xué)版);2005年02期
5 周定法;;電磁逆散射成像的一種混合正則化方法[J];微計(jì)算機(jī)信息;2007年13期
6 顧勇為;歸慶明;張磊;;基于復(fù)共線性診斷的正則化方法[J];信息工程大學(xué)學(xué)報(bào);2007年04期
7 蔡傳寶;湯文成;;基于有限元法-正則化的彈性模量反求算法研究[J];應(yīng)用力學(xué)學(xué)報(bào);2009年01期
8 侯衛(wèi)東,莫玉龍;動(dòng)態(tài)電阻抗圖象重建的正則化方法[J];計(jì)算機(jī)工程;2001年09期
9 王彥飛;數(shù)值求解迭代Tikhonov正則化方法的一點(diǎn)注記[J];數(shù)值計(jì)算與計(jì)算機(jī)應(yīng)用;2002年03期
10 許建華,張學(xué)工,李衍達(dá);最小平方誤差算法的正則化核形式[J];自動(dòng)化學(xué)報(bào);2004年01期
相關(guān)會(huì)議論文 前8條
1 楊元喜;徐天河;;綜合驗(yàn)前模型信息和驗(yàn)后觀測(cè)信息的自適應(yīng)正則化方法[A];《大地測(cè)量與地球動(dòng)力學(xué)進(jìn)展》論文集[C];2004年
2 解凱;呂妍昱;;一種高效的正則化參數(shù)估計(jì)算法[A];全國(guó)第19屆計(jì)算機(jī)技術(shù)與應(yīng)用(CACIS)學(xué)術(shù)會(huì)議論文集(上冊(cè))[C];2008年
3 蘇利敏;王耀威;王彥飛;;基于SAR特征的正則化計(jì)算方法及其在紋理分類(lèi)中的應(yīng)用[A];第25屆中國(guó)控制會(huì)議論文集(下冊(cè))[C];2006年
4 曹毅;呂英華;;基于微遺傳算法和正則化處理的模糊圖像復(fù)原方法[A];全國(guó)第13屆計(jì)算機(jī)輔助設(shè)計(jì)與圖形學(xué)(CAD/CG)學(xué)術(shù)會(huì)議論文集[C];2004年
5 周定法;薄亞明;;解電磁逆散射問(wèn)題的截?cái)嗤耆钚《朔椒╗A];第七屆工業(yè)儀表與自動(dòng)化學(xué)術(shù)會(huì)議論文集[C];2006年
6 魏素花;王雙虎;許海波;;軸對(duì)稱(chēng)物體X射線層析成像的正則化方法[A];全國(guó)射線數(shù)字成像與CT新技術(shù)研討會(huì)論文集[C];2012年
7 劉曉芳;徐文龍;陳永利;;基于非二次正則化的并行磁共振圖像保邊性重建[A];浙江生物醫(yī)學(xué)工程學(xué)會(huì)第九屆年會(huì)論文匯編[C];2011年
8 王金海;王琦;鄭羽;;基于L_1正則化和投影方法的電阻抗圖像重建算法[A];天津市生物醫(yī)學(xué)工程學(xué)會(huì)第三十三屆學(xué)術(shù)年會(huì)論文集[C];2013年
相關(guān)博士學(xué)位論文 前10條
1 鐘敏;反問(wèn)題多尺度迭代正則化方法[D];復(fù)旦大學(xué);2014年
2 產(chǎn)文;Web社區(qū)問(wèn)答檢索的關(guān)鍵技術(shù)研究[D];復(fù)旦大學(xué);2014年
3 王靜;電阻抗成像的幾種正則化方法研究[D];哈爾濱工業(yè)大學(xué);2015年
4 方晟;基于正則化的高倍加速并行磁共振成像技術(shù)[D];清華大學(xué);2010年
5 肖銓武;基于核的正則化學(xué)習(xí)算法[D];中國(guó)科學(xué)技術(shù)大學(xué);2009年
6 薛暉;分類(lèi)器設(shè)計(jì)中的正則化技術(shù)研究[D];南京航空航天大學(xué);2008年
7 王林軍;正則化方法及其在動(dòng)態(tài)載荷識(shí)別中的應(yīng)用[D];湖南大學(xué);2011年
8 吳頡爾;正則化方法及其在模型修正中的應(yīng)用[D];南京航空航天大學(xué);2007年
9 王光新;基于稀疏約束正則化模型的圖像提高分辨率技術(shù)研究[D];國(guó)防科學(xué)技術(shù)大學(xué);2008年
10 楊俊剛;利用稀疏信息的正則化雷達(dá)成像理論與方法研究[D];國(guó)防科學(xué)技術(shù)大學(xué);2013年
相關(guān)碩士學(xué)位論文 前10條
1 焦彩紅;正則化夾角間隔核向量機(jī)[D];河北大學(xué);2015年
2 牛征驥;基于混合范數(shù)的電阻率反演算法研究[D];大連海事大學(xué);2015年
3 楊嬌;參數(shù)變化識(shí)別問(wèn)題的稀疏約束正則化方法及應(yīng)用[D];哈爾濱工業(yè)大學(xué);2015年
4 董國(guó)志;反問(wèn)題的正則化方法及其計(jì)算[D];湖南師范大學(xué);2012年
5 岳建惠;電阻率成像反問(wèn)題的混合正則化方法研究[D];大連海事大學(xué);2012年
6 焦艷東;帶約束的Tikhonov正則化方法的應(yīng)用[D];河北工業(yè)大學(xué);2004年
7 孟晉華;一維熱方程熱源識(shí)別問(wèn)題的正則化方法[D];蘭州大學(xué);2009年
8 曹宏舉;最大團(tuán)問(wèn)題的熵正則化方法研究[D];大連理工大學(xué);2006年
9 梅丹;正則化模型下圖像處理的算法設(shè)計(jì)與實(shí)現(xiàn)[D];國(guó)防科學(xué)技術(shù)大學(xué);2007年
10 方丹;不適定非齊次抽象終止問(wèn)題的正則化方法及其比較[D];華中科技大學(xué);2011年
,本文編號(hào):2029909
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2029909.html