基于Lucene的垂直搜索引擎研究與實現(xiàn)
本文選題:垂直搜索引擎 + Lucene ; 參考:《北京工業(yè)大學》2016年碩士論文
【摘要】:垂直搜索引擎作為一種面向某一主題或行業(yè)的網(wǎng)絡(luò)信息檢索工具,索引數(shù)據(jù)趨于結(jié)構(gòu)化,檢索范圍趨于行業(yè)化,能夠快速、精確地定位與查詢相關(guān)的文檔。本文主要圍繞基于信息檢索工具Lucene的垂直搜索引擎展開研究工作。通過深入研究Lucene基礎(chǔ)排序算法和目前流行的檢索模型,提出了一種融合位置相關(guān)和概率排序的Lucene排序算法的改進方法。通過分析垂直搜索引擎的基本工作原理及架構(gòu),面向汽車主題構(gòu)建了一個小型的垂直搜索引擎系統(tǒng)。搜索引擎中應(yīng)用了改進的Lucene排序算法為檢索模塊提供排序支持。本文的主要研究工作如下:第一,為了體現(xiàn)特征詞在文檔中的相關(guān)位置特征對于詞的重要性影響,提出了一種位置相關(guān)的查詢權(quán)重算法。利用查詢詞在文檔中的不同位置及頻率信息,改進詞權(quán)重的TF-IDF計算方法,獲得位置相關(guān)的查詢詞權(quán)重。第二,以Lucene基礎(chǔ)排序算法為基礎(chǔ),提出了一種融合位置相關(guān)和概率排序的改進方法。首先,考慮到查詢詞在文檔中的位置特征對文檔相關(guān)性評分的影響,將位置相關(guān)的查詢權(quán)重值融入排序算法的評分公式中。然后,利用概率排序原理,將基于樸素貝葉斯分類算法的文檔概率排序值融入排序算法的評分公式中。第三,構(gòu)建了一個小型的汽車垂直搜索引擎,包括采集汽車產(chǎn)品信息、解析網(wǎng)頁文檔、提取結(jié)構(gòu)化信息、建立索引文件和檢索相關(guān)文檔等過程。其中,采用了融合位置相關(guān)和概率排序的Lucene排序算法對檢索結(jié)果進行排序。第四,設(shè)計實驗比較改進算法與Lucene基礎(chǔ)排序算法在搜索質(zhì)量上的差異。實驗結(jié)果表明,與Lucene基礎(chǔ)排序算法相比,使用融合位置相關(guān)和概率排序的改進算法后,檢索的準確率有了較大幅度的提高,召回率和F值較為穩(wěn)定且均有不同程度的提高。改進的排序算法能夠有效的解決原算法中查詢的位置相關(guān)性問題和理論支撐問題,提高檢索的準確率。該算法具有很強的獨立性和可重用性,可以為面向不同的主題的垂直搜索引擎提供排序支持。汽車垂直搜索引擎系統(tǒng)具有簡明的構(gòu)架和函數(shù)接口,為后續(xù)更新和完善系統(tǒng)各模塊的功能提供了方便。
[Abstract]:Vertical search engine is a kind of network information retrieval tool for a certain subject or industry. The index data tends to be structured, and the retrieval scope tends to be industrial, which can locate the relevant documents quickly and accurately. This paper focuses on the vertical search engine based on the information retrieval tool Lucene. By deeply studying the basic sorting algorithm of Lucene and the popular retrieval model, an improved Lucene sorting algorithm combining location correlation and probability sorting is proposed. By analyzing the basic working principle and structure of vertical search engine, a small vertical search engine system for automobile theme is constructed. The improved Lucene sorting algorithm is applied to search engine to provide sorting support for retrieval module. The main work of this paper is as follows: first, in order to reflect the importance of the feature words in the document, a location-dependent query weight algorithm is proposed. By using the information of different positions and frequencies of query words in the document, the TF-IDF calculation method of word weight is improved, and the weight of query words related to location is obtained. Secondly, based on the basic sorting algorithm of Lucene, an improved method of combining position correlation and probability sorting is proposed. Firstly, considering the influence of the location feature of the query word in the document on the document correlation score, the location-related query weight value is incorporated into the scoring formula of the sorting algorithm. Then, using the principle of probability sorting, the document probability sorting value based on naive Bayes classification algorithm is incorporated into the scoring formula of sorting algorithm. Thirdly, a small vertical vehicle search engine is constructed, which includes the process of collecting automobile product information, parsing web pages, extracting structured information, establishing index files and retrieving related documents. Among them, the Lucene sorting algorithm combining position correlation and probability sorting is used to sort the retrieval results. Fourthly, the difference of search quality between the improved algorithm and the Lucene basic sorting algorithm is compared. The experimental results show that, compared with the basic sorting algorithm of Lucene, the improved algorithm of fusion location correlation and probability sorting can greatly improve the retrieval accuracy, and the recall rate and F value are more stable and improved in varying degrees. The improved sorting algorithm can effectively solve the problem of location correlation and theoretical support in the original algorithm, and improve the accuracy of retrieval. The algorithm has strong independence and reusability, and can provide sorting support for vertical search engines facing different topics. The vehicle vertical search engine system has a concise framework and function interface, which provides convenience for updating and perfecting the function of each module of the system.
【學位授予單位】:北京工業(yè)大學
【學位級別】:碩士
【學位授予年份】:2016
【分類號】:TP391.3
【參考文獻】
相關(guān)期刊論文 前10條
1 唐曉波;房小可;;微博中文本特征質(zhì)量對檢索效果的影響[J];現(xiàn)代圖書情報技術(shù);2014年06期
2 王澤賢;;基于Lucene的書目搜索相似度評分算法改進研究[J];圖書情報工作;2014年04期
3 張小琴;王曉輝;;主題信息搜索系統(tǒng)中的搜索策略研究[J];軟件導刊;2014年01期
4 郭衛(wèi)寧;司莉;;國外語義搜索引擎調(diào)查與分析[J];圖書情報工作;2013年23期
5 張宣;劉曉飛;;基于Lucene和Heritrix的全文搜索引擎的設(shè)計與實現(xiàn)[J];現(xiàn)代計算機(專業(yè)版);2013年33期
6 華京生;李萍;;基于Heritrix+Lucene的高校圖書館網(wǎng)站全文搜索引擎構(gòu)建[J];情報探索;2013年09期
7 趙永鑫;雷霖;;Heritrix在電子信息垂直搜索平臺中的應(yīng)用[J];成都大學學報(自然科學版);2013年02期
8 何超;張玉峰;;融合語義相似度的商務(wù)情報鏈接分析算法研究[J];現(xiàn)代圖書情報技術(shù);2013年03期
9 胡嘉海;;基于Lucene的全文搜索引擎的設(shè)計與實現(xiàn)[J];安徽科技;2012年12期
10 袁小潔;;基于Heritrix的商品信息搜索的網(wǎng)絡(luò)爬蟲系統(tǒng)的設(shè)計[J];電腦編程技巧與維護;2012年22期
,本文編號:1836056
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1836056.html