基于圖的文檔檢索技術(shù)研究
發(fā)布時(shí)間:2018-08-23 14:29
【摘要】:隨著計(jì)算機(jī)技術(shù)和互聯(lián)網(wǎng)的發(fā)展,信息檢索已經(jīng)成為日常生產(chǎn)生活中不可缺少的一部分,更受到學(xué)術(shù)界的高度關(guān)注。近年來,圖數(shù)據(jù)的使用方興未艾,互聯(lián)網(wǎng)的發(fā)展伴隨著大數(shù)據(jù)的增長,使得越來越多的應(yīng)用產(chǎn)生圖數(shù)據(jù)。圖數(shù)據(jù)的研究近年來也炙手可熱。文檔檢索的主要任務(wù)是計(jì)算用戶輸入的查詢?cè)~和文檔的相似度,并將文檔依照相似度排序返回給用戶。向量空間模型是信息檢索領(lǐng)域中的基本模型,也是文檔檢索領(lǐng)域中最常用的模型。當(dāng)今很多廣受歡迎的文檔檢索系統(tǒng)依舊以向量空間模型為核心。由于向量空間模型在檢索中將詞項(xiàng)視作獨(dú)立無關(guān)的,割裂了詞項(xiàng)之間的關(guān)系。而實(shí)際的文本中,詞項(xiàng)與詞項(xiàng)之間通常都有相關(guān)性。這就導(dǎo)致了以向量空間模型為核心的文檔檢索系統(tǒng)會(huì)存在如下的情況:計(jì)算出與查詢?cè)~相似度很高的文檔,其內(nèi)容的意思與查詢?cè)~關(guān)聯(lián)性不夠高,甚至意思完全相反。而近年來圖數(shù)據(jù)得到廣泛應(yīng)用,很重要的原因就是圖能直觀地表示節(jié)點(diǎn)與邊之間的關(guān)系。基于以上問題,本文提出了基于圖的文檔檢索方法。將查詢?cè)~和文檔用圖進(jìn)行表示。通過計(jì)算查詢圖和文檔圖之間的相似度來得到查詢?cè)~和文檔之間相似度的方法,對(duì)查詢?cè)~和文檔的相似程度進(jìn)行定量化計(jì)算。首先,本文利用自然語言處理中的依存分析和詞性標(biāo)注的研究成果,提出基于依存分析的文本表示圖模型,將查詢?cè)~和文檔文本表示成圖?紤]到圖計(jì)算的開銷問題,本文提出文檔語義單元的概念,并以文檔語義單元為粒度構(gòu)建圖,這樣不同于以往信息檢索中將查詢與文檔視為對(duì)等的實(shí)體,本文提出的方法是將查詢?cè)~和文檔放在不對(duì)等的層面上;其次,本文基于圖論的相關(guān)知識(shí),提出基于廣義最大公共子圖的圖相似度計(jì)算算法,由此可得到查詢圖模型和文本圖模型的相似度;再次,使用上一步得到查詢和文檔各個(gè)語義單元的相似度數(shù)據(jù),考慮到文檔中不同位置的語義單元的重要程度可能不同,本文提出文檔評(píng)分方法,計(jì)算查詢和文檔之間的相似度并以此作為排序和返回結(jié)果的依據(jù)。最后,分別利用中文和英文兩個(gè)文檔集,通過分析不同文檔評(píng)分方法下算法的結(jié)果質(zhì)量的表現(xiàn)以及和現(xiàn)有的方法與技術(shù)的結(jié)果進(jìn)行對(duì)比,實(shí)驗(yàn)表明,本文提出的方法能得到質(zhì)量更高的文檔檢索結(jié)果。
[Abstract]:With the development of computer technology and Internet, information retrieval has become an indispensable part of daily production and life. In recent years, the use of graph data is in the ascendant. With the development of big data, more and more applications produce graph data. The study of graph data is also hot in recent years. The main task of document retrieval is to calculate the similarity between the query words entered by the user and the document, and return the documents to the user according to the similarity. Vector space model is the basic model in the field of information retrieval, and it is also the most commonly used model in the field of document retrieval. Nowadays, many popular document retrieval systems still take vector space model as the core. Because the vector space model regards the word item as independent in the retrieval, it separates the relation between the word items. In the actual text, there is usually a correlation between the word item and the word item. This leads to the following situations in the document retrieval system with vector space model as the core: the document with high similarity to the query words is calculated, and the meaning of the document is not high enough to be related to the query words, even the meaning is completely opposite. In recent years, graph data have been widely used, the important reason is that graph can represent the relationship between nodes and edges intuitively. Based on the above problems, this paper proposes a graph-based document retrieval method. The query words and documents are graphically represented. The similarity between query words and documents is calculated by calculating the similarity between query graph and document graph, and the similarity degree between query word and document is calculated quantitatively. Firstly, based on the research results of dependency analysis and part of speech tagging in natural language processing, a text representation graph model based on dependency analysis is proposed, in which query words and document texts are represented as graphs. Considering the overhead of graph computation, this paper proposes the concept of document semantic unit, and takes document semantic unit as granularity to construct graph, which is different from the fact that query and document are regarded as equivalent entities in information retrieval. The method proposed in this paper is to put the query words and documents on the unequal level. Secondly, based on the related knowledge of graph theory, this paper proposes a graph similarity calculation algorithm based on the generalized maximum common subgraph. The similarity between the query graph model and the text graph model can be obtained. Thirdly, the similarity data between the query and each semantic unit of the document can be obtained by using the previous step, considering that the importance of the semantic unit at different locations in the document may be different. In this paper, a document scoring method is proposed to calculate the similarity between the query and the document and to use it as the basis for sorting and returning the results. Finally, by using the Chinese and English document sets, the performance of the algorithm under different document scoring methods is analyzed and compared with the results of the existing methods and techniques. The experimental results show that, The method proposed in this paper can obtain higher quality document retrieval results.
【學(xué)位授予單位】:哈爾濱工程大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2016
【分類號(hào)】:TP391.3
本文編號(hào):2199350
[Abstract]:With the development of computer technology and Internet, information retrieval has become an indispensable part of daily production and life. In recent years, the use of graph data is in the ascendant. With the development of big data, more and more applications produce graph data. The study of graph data is also hot in recent years. The main task of document retrieval is to calculate the similarity between the query words entered by the user and the document, and return the documents to the user according to the similarity. Vector space model is the basic model in the field of information retrieval, and it is also the most commonly used model in the field of document retrieval. Nowadays, many popular document retrieval systems still take vector space model as the core. Because the vector space model regards the word item as independent in the retrieval, it separates the relation between the word items. In the actual text, there is usually a correlation between the word item and the word item. This leads to the following situations in the document retrieval system with vector space model as the core: the document with high similarity to the query words is calculated, and the meaning of the document is not high enough to be related to the query words, even the meaning is completely opposite. In recent years, graph data have been widely used, the important reason is that graph can represent the relationship between nodes and edges intuitively. Based on the above problems, this paper proposes a graph-based document retrieval method. The query words and documents are graphically represented. The similarity between query words and documents is calculated by calculating the similarity between query graph and document graph, and the similarity degree between query word and document is calculated quantitatively. Firstly, based on the research results of dependency analysis and part of speech tagging in natural language processing, a text representation graph model based on dependency analysis is proposed, in which query words and document texts are represented as graphs. Considering the overhead of graph computation, this paper proposes the concept of document semantic unit, and takes document semantic unit as granularity to construct graph, which is different from the fact that query and document are regarded as equivalent entities in information retrieval. The method proposed in this paper is to put the query words and documents on the unequal level. Secondly, based on the related knowledge of graph theory, this paper proposes a graph similarity calculation algorithm based on the generalized maximum common subgraph. The similarity between the query graph model and the text graph model can be obtained. Thirdly, the similarity data between the query and each semantic unit of the document can be obtained by using the previous step, considering that the importance of the semantic unit at different locations in the document may be different. In this paper, a document scoring method is proposed to calculate the similarity between the query and the document and to use it as the basis for sorting and returning the results. Finally, by using the Chinese and English document sets, the performance of the algorithm under different document scoring methods is analyzed and compared with the results of the existing methods and techniques. The experimental results show that, The method proposed in this paper can obtain higher quality document retrieval results.
【學(xué)位授予單位】:哈爾濱工程大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2016
【分類號(hào)】:TP391.3
【參考文獻(xiàn)】
相關(guān)期刊論文 前3條
1 鄒加棋;陳國龍;郭文忠;;基于圖模型的中文文檔分類研究[J];小型微型計(jì)算機(jī)系統(tǒng);2006年04期
2 周昭濤,卜東波,程學(xué)旗;文本的圖表示初探[J];中文信息學(xué)報(bào);2005年02期
3 成穎;孫建軍;;信息檢索中的相關(guān)性研究[J];情報(bào)學(xué)報(bào);2004年06期
相關(guān)博士學(xué)位論文 前1條
1 王進(jìn);基于本體的語義信息檢索研究[D];中國科學(xué)技術(shù)大學(xué);2006年
,本文編號(hào):2199350
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2199350.html
最近更新
教材專著