基于LDA模型的領(lǐng)域自動(dòng)問答系統(tǒng)
本文選題:分詞 + LDA模型; 參考:《安徽大學(xué)》2013年碩士論文
【摘要】:隨著因特網(wǎng)的發(fā)展,其包含的信息量不斷增加,人們普遍希望能在互聯(lián)網(wǎng)中快速地找到自己想要的信息。同時(shí),目前搜索引擎的有效應(yīng)用率不高,搜索引擎的不足仍有很多,限制著人們獲取信息的效率。自動(dòng)問答系統(tǒng)可以更智能、更快速、更準(zhǔn)確地獲取用戶想查詢的內(nèi)容,近年來成為了國內(nèi)外學(xué)者廣泛研究的熱點(diǎn)。 本文以實(shí)現(xiàn)一個(gè)針對(duì)計(jì)算機(jī)常見故障的解決辦法這一領(lǐng)域的自動(dòng)問答系統(tǒng)為目標(biāo),深入探討了自動(dòng)問答系統(tǒng)從問題處理一直到最終給出答案的全過程。在研究過程中,發(fā)現(xiàn)領(lǐng)域分詞和語義相似度的計(jì)算是自動(dòng)問答系統(tǒng)的核心內(nèi)容,相對(duì)于目前的系統(tǒng)需求以及研究現(xiàn)狀,還有很多改進(jìn)的地方。本文主要對(duì)這兩個(gè)方面進(jìn)行改進(jìn),在每一節(jié)也地改進(jìn)后的結(jié)果進(jìn)行了實(shí)驗(yàn)論證,說明改進(jìn)后的確加強(qiáng)了檢索的結(jié)果。最后設(shè)計(jì)實(shí)現(xiàn)了一個(gè)可以對(duì)用戶提出的計(jì)算機(jī)故障相關(guān)問題自動(dòng)給出解決辦法的一個(gè)原型系統(tǒng)。 首先,本文討論了在中文分詞領(lǐng)域常用的方法,對(duì)基于詞典的分詞方法、基于統(tǒng)計(jì)的分詞方法這兩個(gè)經(jīng)典的方法做了深入分析,對(duì)其他方法做了簡要介紹,并比較了不同的方法的特性和效果。然后提出了一個(gè)基于領(lǐng)域詞典與詞串互信息的分詞方法,該方法加入了語義的信息,并考慮到領(lǐng)域?qū)I(yè)詞匯的特性,最后加入了詞串的互信息來解決分詞中的岐義問題。通過實(shí)驗(yàn)證明,這些改進(jìn)提升了領(lǐng)域文本的分詞性能。 其次,本文對(duì)語義相似度的概念和計(jì)算原則做了簡單討論,并研究了基于編輯距離的語義相似度計(jì)算方法、基于依存關(guān)系的語義相似度計(jì)算方法以及基于語義距離和本體的相似度計(jì)算方法,同時(shí)提出了對(duì)經(jīng)典相似度計(jì)算方法改進(jìn)的一個(gè)新方法。新方法使用LDA模型,經(jīng)過領(lǐng)域語料庫的訓(xùn)練,得到一個(gè)領(lǐng)域相關(guān)的詞一主題的分布,由于考慮了同一個(gè)主題下的詞之間的語義相關(guān)性,因此計(jì)算得到的語義相似度更為可靠。 最后,本文對(duì)針對(duì)計(jì)算機(jī)常見故障的解決辦法這一領(lǐng)域的自動(dòng)問答系統(tǒng)進(jìn)行了系統(tǒng)設(shè)計(jì),良好的設(shè)計(jì)使系統(tǒng)的框架具備了高內(nèi)聚、低耦合的特性,這樣可以大大減小系統(tǒng)的升級(jí)和后期的維護(hù)的代價(jià)。同時(shí)在Windows XP平臺(tái)下,基于.NET Framework框架開發(fā)實(shí)現(xiàn)了這一系統(tǒng)的演示版本,通過實(shí)際測試,系統(tǒng)的運(yùn)行效果良好。
[Abstract]:With the development of the Internet, the amount of information it contains is increasing. People generally hope to find the information they want quickly in the Internet. At the same time, the effective application rate of search engine is not high, and the lack of search engine is still a lot, which limits the efficiency of people to obtain information. The automatic question answering system can be more intelligent, faster and more accurate to obtain the content that the user wants to query, which has become a hot spot of domestic and foreign scholars in recent years. Aiming at the realization of an automatic question answering system in the field of solving common computer faults, this paper deeply discusses the whole process of the automatic question answering system from question processing to the final answer. In the research process, it is found that the computation of domain word segmentation and semantic similarity is the core content of the automatic question answering system, and there are still many improvements compared with the current system requirements and research status. In this paper, the two aspects are improved, and the experimental results are demonstrated in each section, which shows that the improved results really strengthen the retrieval results. In the end, a prototype system is designed and implemented, which can automatically solve the problems related to computer faults raised by users. First of all, this paper discusses the commonly used methods in the field of Chinese word segmentation, and makes an in-depth analysis of the two classical methods of word segmentation based on dictionary and statistics, and briefly introduces the other methods. The characteristics and effects of different methods are compared. Then, a word segmentation method based on domain dictionary and string mutual information is proposed. This method adds semantic information and takes into account the characteristics of domain specialized vocabulary, and finally adds the mutual information of string to solve the ambiguity problem in word segmentation. Experimental results show that these improvements improve the performance of domain text segmentation. Secondly, the concept and calculation principle of semantic similarity are briefly discussed, and the method of calculating semantic similarity based on editing distance is studied. The semantic similarity calculation method based on dependency relationship and the similarity calculation method based on semantic distance and ontology are presented. A new method to improve the classical similarity calculation method is proposed. The new method uses the LDA model and the domain corpus is trained to obtain the distribution of a domain-dependent word-topic. Because the semantic correlation between the words under the same topic is considered the calculated semantic similarity is more reliable. Finally, the system design of the automatic question answering system in the field of the solution of common computer faults is carried out in this paper. The good design makes the system frame have the characteristics of high cohesion and low coupling. This can greatly reduce the system upgrade and later maintenance costs. At the same time, the demo version of the system is developed based on. Net Framework on Windows XP platform. Through the actual test, the running effect of the system is good.
【學(xué)位授予單位】:安徽大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2013
【分類號(hào)】:TP391.1
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 鄧志鴻,唐世渭,張銘,楊冬青,陳捷;Ontology研究綜述[J];北京大學(xué)學(xué)報(bào)(自然科學(xué)版);2002年05期
2 莫麗萍,王樹西,姜吉發(fā),雷雨霞;問答系統(tǒng)和淺層結(jié)構(gòu)模式推理[J];廣西師范大學(xué)學(xué)報(bào)(自然科學(xué)版);2004年01期
3 郭艷華,周昌樂;一種漢語語句依存關(guān)系網(wǎng)協(xié)動(dòng)生成方法研究[J];杭州電子工業(yè)學(xué)院學(xué)報(bào);2000年04期
4 溫滔,朱巧明,呂強(qiáng);一種快速漢語分詞算法[J];計(jì)算機(jī)工程;2004年19期
5 孫茂松,肖明,鄒嘉彥;基于無指導(dǎo)學(xué)習(xí)策略的無詞表?xiàng)l件下的漢語自動(dòng)分詞[J];計(jì)算機(jī)學(xué)報(bào);2004年06期
6 吳棟,滕育平;中文信息檢索引擎中的分詞與檢索技術(shù)[J];計(jì)算機(jī)應(yīng)用;2004年07期
7 徐德智;鄭春卉;K. Passi;;基于SUMO的概念語義相似度研究[J];計(jì)算機(jī)應(yīng)用;2006年01期
8 李彬,劉挺,秦兵,李生;基于語義依存的漢語句子相似度計(jì)算[J];計(jì)算機(jī)應(yīng)用研究;2003年12期
9 揭春雨 ,劉源 ,梁南元;論漢語自動(dòng)分詞方法[J];中文信息學(xué)報(bào);1989年01期
10 閆引堂,周曉強(qiáng);交集型歧義字段切分方法研究[J];情報(bào)學(xué)報(bào);2000年06期
相關(guān)博士學(xué)位論文 前1條
1 邱明;語義相似性度量及其在設(shè)計(jì)管理系統(tǒng)中的應(yīng)用[D];浙江大學(xué);2006年
,本文編號(hào):2031272
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2031272.html