天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當(dāng)前位置:主頁 > 科技論文 > 搜索引擎論文 >

開放域問答系統(tǒng)答案源獲取方法研究與實現(xiàn)

發(fā)布時間:2018-07-02 20:47

  本文選題:自動問答系統(tǒng) + 答案源獲取 ; 參考:《太原理工大學(xué)》2012年碩士論文


【摘要】:當(dāng)今社會,互聯(lián)網(wǎng)中所包含的種類繁多內(nèi)容豐富的知識資源,為我們?nèi)粘W(xué)習(xí)和工作中面對問題時尋求幫助和獲取信息提供了很大的方便。目前的Google和百度等搜索引擎是人們從網(wǎng)絡(luò)中獲取信息的主要途徑,然而,這些傳統(tǒng)的搜索引擎隨著用戶對信息精確性和時間高效性要求的提高,暴露出一些弊端,例如,它按照關(guān)鍵詞組合的形式分析用戶輸入的查詢語句,這會對用戶的搜索目的產(chǎn)生偏差,返回給用戶的結(jié)果是大量網(wǎng)頁的集合,需要用戶去甄別和查找,而非用戶希望得到的準(zhǔn)確簡潔的答案。在傳統(tǒng)搜索引擎的基礎(chǔ)上,新一代的自動問答系統(tǒng)因為其高效實用的特點,成為信息檢索領(lǐng)域的研究熱點和趨勢。一方面,它方便用戶使用自然語言提問,另一方面,返回給用戶的是最終的答案,具有較高的理論研究價值和廣闊的應(yīng)用前景。 自動問答系統(tǒng)一般主要包括問題分析,信息檢索和答案抽取三個模塊。其中答案抽取是問答系統(tǒng)的最后關(guān)鍵步驟,能否做好這一步關(guān)系著提交給用戶的答案是否準(zhǔn)確和高效。本文主要針對最后一步答案源獲取方法進(jìn)行研究,結(jié)合前人的研究成果,在Web網(wǎng)頁的抓取,網(wǎng)頁去重,網(wǎng)頁信息提取等方面進(jìn)行了研究,主要進(jìn)行了以下工作: (1)針對用戶提出的問題在Web中搜尋對應(yīng)的答案網(wǎng)頁,在傳統(tǒng)搜索引擎的平臺上,將相關(guān)的答案網(wǎng)頁保存到本地。在本實驗設(shè)計中,我們借助百度知道的知識庫,通過Crawler爬蟲程序,依據(jù)相應(yīng)的抓取算法,從URL鏈向深度和廣度抓取一定數(shù)量的網(wǎng)頁,作為我們下一步信息提取的答案源庫。 (2)在抓取網(wǎng)頁文檔的過程中,針對網(wǎng)絡(luò)中存在的大量內(nèi)容相同和相似的網(wǎng)頁,會增加系統(tǒng)的開銷和降低效率。通過借鑒前人在網(wǎng)頁去重方面的相關(guān)研究成果,引入了基于文本塊,利用shingle和基于集合統(tǒng)計的網(wǎng)頁去重方法,并給出了測評的標(biāo)準(zhǔn)。 (3)在對網(wǎng)頁文檔信息提取的過程中,可以將網(wǎng)頁標(biāo)簽,無關(guān)的廣告和圖片等信息進(jìn)行過濾,利用DOM樹的節(jié)點結(jié)構(gòu)來結(jié)構(gòu)化表示網(wǎng)頁內(nèi)容,從節(jié)點中提取出網(wǎng)頁文檔的文本信息,為后續(xù)的答案提取做準(zhǔn)備。設(shè)計實驗方案,給出相關(guān)說明。
[Abstract]:In today's society, there are many kinds of knowledge resources in the Internet, which provide great convenience for us to seek help and obtain information when facing problems in our daily study and work. At present, search engines such as Google and Baidu are the main ways for people to obtain information from the Internet. However, these traditional search engines have exposed some disadvantages with the improvement of users' requirements for information accuracy and time efficiency, such as, It analyzes the query statements input by the user according to the form of keyword combination, which will cause deviation to the user's search purpose. The result returned to the user is a large number of web pages, which need to be identified and searched by the user. Rather than the exact and succinct answers that users want. Based on the traditional search engine, the new generation of automatic question answering system has become the research hotspot and trend in the field of information retrieval because of its high efficiency and practicality. On the one hand, it is convenient for users to use natural language to ask questions. On the other hand, it returns the final answer to users, which has high theoretical research value and broad application prospect. The automatic question answering system includes three modules: question analysis, information retrieval and answer extraction. The answer extraction is the last key step in the question answering system. Whether it can be done well or not is related to whether the answer submitted to the user is accurate and efficient. In this paper, the last step of the source of the answer to the source of the study, combined with previous research results, in the Web page grab, web pages to heavy, web information extraction and other aspects of research. The main work is as follows: (1) search the corresponding answer pages in the Web for the user's questions, and save the relevant answer pages to the local on the platform of the traditional search engine. In this experiment design, we use the knowledge base that Baidu knows, through Crawler crawler program, according to the corresponding crawling algorithm, we grab a certain number of web pages from URL chain to depth and breadth. (2) in the process of crawling web pages, a large number of web pages with the same and similar content in the network will increase the cost of the system and reduce the efficiency. By referring to the related research results of previous researches on web page removal, this paper introduces a method based on text block, which uses shingle and set statistics to remove the weight of web pages. The evaluation standard is given. (3) in the process of extracting web document information, we can filter the information such as page label, irrelevant advertisement and picture, and use the node structure of Dom tree to structurally represent the web page content. The text information of the web page document is extracted from the node to prepare for the subsequent answer extraction. The experimental scheme is designed and the related explanation is given.
【學(xué)位授予單位】:太原理工大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2012
【分類號】:TP393.092

【參考文獻(xiàn)】

相關(guān)期刊論文 前10條

1 樊勇;鄭家恒;;基于主題的網(wǎng)頁去重[J];電腦開發(fā)與應(yīng)用;2008年04期

2 楊敬偉;楊文柱;高悅;;基于DOM的Web信息抽取規(guī)則的構(gòu)造與實現(xiàn)[J];河北大學(xué)學(xué)報(自然科學(xué)版);2007年02期

3 秦兵;劉挺;王洋;鄭實福;李生;;基于常問問題集的中文問答系統(tǒng)研究[J];哈爾濱工業(yè)大學(xué)學(xué)報;2003年10期

4 余正濤,樊孝忠,郭劍毅;基于支持向量機(jī)的漢語問句分類[J];華南理工大學(xué)學(xué)報(自然科學(xué)版);2005年09期

5 王琦,唐世渭,楊冬青,王騰蛟;基于DOM的網(wǎng)頁主題信息自動提取[J];計算機(jī)研究與發(fā)展;2004年10期

6 李永平,張茂元;基于并行模糊歸類的網(wǎng)頁信息提取方法研究[J];計算機(jī)工程與應(yīng)用;2003年21期

7 張樹瑜,朱仲英;基于MT決策樹的Web信息抽取研究[J];計算機(jī)工程與應(yīng)用;2004年13期

8 樊勇;鄭家恒;;網(wǎng)頁去重方法研究[J];計算機(jī)工程與應(yīng)用;2009年12期

9 張巍;陳俊杰;;淺層語義分析及SPARQL在問答系統(tǒng)中的應(yīng)用[J];計算機(jī)工程與應(yīng)用;2011年02期

10 顧韻華;田偉;;基于DOM模型擴(kuò)展的Web信息提取[J];計算機(jī)科學(xué);2009年11期



本文編號:2090875

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2090875.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶1d118***提供,本站僅收錄摘要或目錄,作者需要刪除請E-mail郵箱bigeng88@qq.com