基于Lucene的垂直搜索引擎研究與實(shí)現(xiàn)
本文選題:垂直搜索引擎 + Lucene; 參考:《北京工業(yè)大學(xué)》2016年碩士論文
【摘要】:垂直搜索引擎作為一種面向某一主題或行業(yè)的網(wǎng)絡(luò)信息檢索工具,索引數(shù)據(jù)趨于結(jié)構(gòu)化,檢索范圍趨于行業(yè)化,能夠快速、精確地定位與查詢相關(guān)的文檔。本文主要圍繞基于信息檢索工具Lucene的垂直搜索引擎展開研究工作。通過深入研究Lucene基礎(chǔ)排序算法和目前流行的檢索模型,提出了一種融合位置相關(guān)和概率排序的Lucene排序算法的改進(jìn)方法。通過分析垂直搜索引擎的基本工作原理及架構(gòu),面向汽車主題構(gòu)建了一個(gè)小型的垂直搜索引擎系統(tǒng)。搜索引擎中應(yīng)用了改進(jìn)的Lucene排序算法為檢索模塊提供排序支持。本文的主要研究工作如下:第一,為了體現(xiàn)特征詞在文檔中的相關(guān)位置特征對(duì)于詞的重要性影響,提出了一種位置相關(guān)的查詢權(quán)重算法。利用查詢?cè)~在文檔中的不同位置及頻率信息,改進(jìn)詞權(quán)重的TF-IDF計(jì)算方法,獲得位置相關(guān)的查詢?cè)~權(quán)重。第二,以Lucene基礎(chǔ)排序算法為基礎(chǔ),提出了一種融合位置相關(guān)和概率排序的改進(jìn)方法。首先,考慮到查詢?cè)~在文檔中的位置特征對(duì)文檔相關(guān)性評(píng)分的影響,將位置相關(guān)的查詢權(quán)重值融入排序算法的評(píng)分公式中。然后,利用概率排序原理,將基于樸素貝葉斯分類算法的文檔概率排序值融入排序算法的評(píng)分公式中。第三,構(gòu)建了一個(gè)小型的汽車垂直搜索引擎,包括采集汽車產(chǎn)品信息、解析網(wǎng)頁文檔、提取結(jié)構(gòu)化信息、建立索引文件和檢索相關(guān)文檔等過程。其中,采用了融合位置相關(guān)和概率排序的Lucene排序算法對(duì)檢索結(jié)果進(jìn)行排序。第四,設(shè)計(jì)實(shí)驗(yàn)比較改進(jìn)算法與Lucene基礎(chǔ)排序算法在搜索質(zhì)量上的差異。實(shí)驗(yàn)結(jié)果表明,與Lucene基礎(chǔ)排序算法相比,使用融合位置相關(guān)和概率排序的改進(jìn)算法后,檢索的準(zhǔn)確率有了較大幅度的提高,召回率和F值較為穩(wěn)定且均有不同程度的提高。改進(jìn)的排序算法能夠有效的解決原算法中查詢的位置相關(guān)性問題和理論支撐問題,提高檢索的準(zhǔn)確率。該算法具有很強(qiáng)的獨(dú)立性和可重用性,可以為面向不同的主題的垂直搜索引擎提供排序支持。汽車垂直搜索引擎系統(tǒng)具有簡(jiǎn)明的構(gòu)架和函數(shù)接口,為后續(xù)更新和完善系統(tǒng)各模塊的功能提供了方便。
[Abstract]:Vertical search engine is a kind of network information retrieval tool for a certain subject or industry. The index data tends to be structured, and the retrieval scope tends to be industrial, which can locate the relevant documents quickly and accurately. This paper focuses on the vertical search engine based on the information retrieval tool Lucene. By deeply studying the basic sorting algorithm of Lucene and the popular retrieval model, an improved Lucene sorting algorithm combining location correlation and probability sorting is proposed. By analyzing the basic working principle and structure of vertical search engine, a small vertical search engine system for automobile theme is constructed. The improved Lucene sorting algorithm is applied to search engine to provide sorting support for retrieval module. The main work of this paper is as follows: first, in order to reflect the importance of the feature words in the document, a location-dependent query weight algorithm is proposed. By using the information of different positions and frequencies of query words in the document, the TF-IDF calculation method of word weight is improved, and the weight of query words related to location is obtained. Secondly, based on the basic sorting algorithm of Lucene, an improved method of combining position correlation and probability sorting is proposed. Firstly, considering the influence of the location feature of the query word in the document on the document correlation score, the location-related query weight value is incorporated into the scoring formula of the sorting algorithm. Then, using the principle of probability sorting, the document probability sorting value based on naive Bayes classification algorithm is incorporated into the scoring formula of sorting algorithm. Thirdly, a small vertical vehicle search engine is constructed, which includes the process of collecting automobile product information, parsing web pages, extracting structured information, establishing index files and retrieving related documents. Among them, the Lucene sorting algorithm combining position correlation and probability sorting is used to sort the retrieval results. Fourthly, the difference of search quality between the improved algorithm and the Lucene basic sorting algorithm is compared. The experimental results show that, compared with the basic sorting algorithm of Lucene, the improved algorithm of fusion location correlation and probability sorting can greatly improve the retrieval accuracy, and the recall rate and F value are more stable and improved in varying degrees. The improved sorting algorithm can effectively solve the problem of location correlation and theoretical support in the original algorithm, and improve the accuracy of retrieval. The algorithm has strong independence and reusability, and can provide sorting support for vertical search engines facing different topics. The vehicle vertical search engine system has a concise framework and function interface, which provides convenience for updating and perfecting the function of each module of the system.
【學(xué)位授予單位】:北京工業(yè)大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2016
【分類號(hào)】:TP391.3
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 唐曉波;房小可;;微博中文本特征質(zhì)量對(duì)檢索效果的影響[J];現(xiàn)代圖書情報(bào)技術(shù);2014年06期
2 王澤賢;;基于Lucene的書目搜索相似度評(píng)分算法改進(jìn)研究[J];圖書情報(bào)工作;2014年04期
3 張小琴;王曉輝;;主題信息搜索系統(tǒng)中的搜索策略研究[J];軟件導(dǎo)刊;2014年01期
4 郭衛(wèi)寧;司莉;;國(guó)外語義搜索引擎調(diào)查與分析[J];圖書情報(bào)工作;2013年23期
5 張宣;劉曉飛;;基于Lucene和Heritrix的全文搜索引擎的設(shè)計(jì)與實(shí)現(xiàn)[J];現(xiàn)代計(jì)算機(jī)(專業(yè)版);2013年33期
6 華京生;李萍;;基于Heritrix+Lucene的高校圖書館網(wǎng)站全文搜索引擎構(gòu)建[J];情報(bào)探索;2013年09期
7 趙永鑫;雷霖;;Heritrix在電子信息垂直搜索平臺(tái)中的應(yīng)用[J];成都大學(xué)學(xué)報(bào)(自然科學(xué)版);2013年02期
8 何超;張玉峰;;融合語義相似度的商務(wù)情報(bào)鏈接分析算法研究[J];現(xiàn)代圖書情報(bào)技術(shù);2013年03期
9 胡嘉海;;基于Lucene的全文搜索引擎的設(shè)計(jì)與實(shí)現(xiàn)[J];安徽科技;2012年12期
10 袁小潔;;基于Heritrix的商品信息搜索的網(wǎng)絡(luò)爬蟲系統(tǒng)的設(shè)計(jì)[J];電腦編程技巧與維護(hù);2012年22期
,本文編號(hào):1836056
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1836056.html