基于Lucene的個(gè)性化搜索引擎系統(tǒng)

發(fā)布時(shí)間：2018-07-13 15:35

【摘要】：互聯(lián)網(wǎng)快速發(fā)展帶來(lái)了知識(shí)爆炸，用戶面臨海量數(shù)據(jù)必須借助檢索工具的幫助來(lái)找到需要的信息。需求決定市場(chǎng)，以Google和百度為代表的各種搜索應(yīng)用應(yīng)運(yùn)而生，改變了互聯(lián)網(wǎng)。傳統(tǒng)檢索技術(shù)在理論和實(shí)踐上都已經(jīng)相當(dāng)成熟，開源社區(qū)誕生了諸如Xapian、Lucene等第三方API庫(kù)，以及基于第三方庫(kù)的完整搜索解決方案。本文對(duì)搜索引擎的原理、組成、工作流程等方面做了深入細(xì)致地分析，介紹了每個(gè)模塊的理論基礎(chǔ)，，并且著重研究改進(jìn)著名API類庫(kù)Lucene，分析Lucene的模塊結(jié)構(gòu)、文件格式、索引過(guò)程、結(jié)果排序。目前，主流的搜索解決方案并不提供對(duì)javascript腳本的支持，在網(wǎng)頁(yè)爬取數(shù)量和速度上做了折中。近年來(lái)出現(xiàn)的快速javascript解釋引擎為解決這個(gè)問(wèn)題提供了可能。本文在爬蟲模塊引入了腳本解釋引擎以提高爬蟲對(duì)javascript腳本的理解，模仿C++重載運(yùn)算符的原理，將Javascript中涉及到URL賦值的運(yùn)算符重載成集合運(yùn)算，實(shí)現(xiàn)了對(duì)腳本URL的提取，在內(nèi)網(wǎng)和外網(wǎng)中做了對(duì)比試驗(yàn)，總結(jié)了內(nèi)網(wǎng)成功和外網(wǎng)失敗的原因。鏈接分析是衡量網(wǎng)頁(yè)質(zhì)量的重要參數(shù)，本文在Lucene原有的評(píng)分公式中引入了PageRank算法來(lái)提高的網(wǎng)頁(yè)評(píng)分的準(zhǔn)確性，以改善搜索結(jié)果質(zhì)量，并且在原有冪迭代計(jì)算基礎(chǔ)之上提出了簡(jiǎn)易的計(jì)算方式，提高了計(jì)算速度。Lucene設(shè)計(jì)優(yōu)秀，在各個(gè)功能模塊開放了大量的接口以滿足用戶的自定義需求。利用這些接口，本文做了工程實(shí)踐，并與原有的評(píng)分公式做了實(shí)驗(yàn)對(duì)比。最后，本文在搜索引擎的個(gè)性化方面做了實(shí)踐探索。傳統(tǒng)搜索引擎一般基于關(guān)鍵字匹配，沒(méi)有充分使用用戶的個(gè)性信息，缺少個(gè)性化功能。本文介紹了用戶信息的收集、用戶模型的建立以及使用，并在這些理論的指導(dǎo)下，結(jié)合工程難度，設(shè)計(jì)實(shí)現(xiàn)了一個(gè)簡(jiǎn)單的個(gè)性化搜索模塊。實(shí)驗(yàn)結(jié)果表明，論文設(shè)計(jì)實(shí)現(xiàn)的個(gè)性化功能模塊是有效的。
[Abstract]:The rapid development of the Internet has brought about a knowledge explosion. Users have to use the help of retrieval tools to find the information they need in the face of massive data. Demand determines the market, and various search applications represented by Google and Baidu emerge as the times require, changing the Internet. The traditional retrieval technology has been quite mature in theory and practice. The open source community has developed third-party API libraries such as Xapianli Lucene and a complete search solution based on third-party libraries. In this paper, the principle, composition and workflow of search engine are analyzed in detail, the theoretical basis of each module is introduced, and the improvement of the famous API class library Luceneis emphatically studied, and the module structure, file format, indexing process of Lucene are analyzed. The results are sorted. Currently, mainstream search solutions do not provide support for javascript scripts, making a compromise on the number and speed of web crawls. The emergence of fast javascript interpretation engines in recent years makes it possible to solve this problem. In this paper, the script interpretation engine is introduced into the crawler module to improve the crawler's understanding of the javascript script, imitating the principle of C overloading operator, reloading the operators involved in the assignment of URLs into set operations in Javascript, and realizing the extraction of script URLs. A comparative experiment is made between the intranet and the extranet, and the reasons for the success and failure of the intranet are summarized. Link analysis is an important parameter to measure the quality of web pages. In this paper, PageRank algorithm is introduced into Lucene's original scoring formula to improve the accuracy of web pages, so as to improve the quality of search results. And on the basis of the original power iterative calculation, a simple calculation method is proposed, which improves the speed of calculation. Lucene design is excellent, and a large number of interfaces are opened in each functional module to meet the user's custom requirements. Using these interfaces, this paper has done the engineering practice, and has made the experiment contrast with the original scoring formula. Finally, this article has made the practice exploration in the search engine personalization aspect. Traditional search engines are generally based on keyword matching, do not fully use the user's personality information, lack of personalized function. This paper introduces the collection of user information, the establishment and use of user model, and under the guidance of these theories, a simple personalized search module is designed and implemented under the guidance of these theories and combining with the engineering difficulty. The experimental results show that the personalized function module designed in this paper is effective.
【學(xué)位授予單位】：中國(guó)艦船研究院
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2013
【分類號(hào)】：TP391.3

【參考文獻(xiàn)】

相關(guān)博士學(xué)位論文前1條

1 何因;排序?qū)W習(xí)中基于直接優(yōu)化信息檢索評(píng)價(jià)準(zhǔn)則算法的理論分析[D];中國(guó)科學(xué)技術(shù)大學(xué);2010年

相關(guān)碩士學(xué)位論文前8條

1 金艷偉;基于馬爾可夫隨機(jī)場(chǎng)的蒙古文信息檢索模型研究[D];內(nèi)蒙古大學(xué);2011年

2 梁萍;搜索引擎中網(wǎng)絡(luò)爬蟲及結(jié)果聚類的研究與實(shí)現(xiàn)[D];中國(guó)科學(xué)技術(shù)大學(xué);2011年

3 張校乾;基于Lucene的全文檢索系統(tǒng)的研究與應(yīng)用[D];大連理工大學(xué);2005年

4 何世林;基于Java技術(shù)的搜索引擎研究與實(shí)現(xiàn)[D];西南交通大學(xué);2006年

5 岑杰;面向情報(bào)領(lǐng)域的文本自動(dòng)分類系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)[D];西安電子科技大學(xué);2008年

6 亢明;基于矢量量化的語(yǔ)音識(shí)別及全文檢索研究[D];重慶大學(xué);2009年

7 楊光偉;基于Lucene的個(gè)性化搜索引擎的研究與實(shí)現(xiàn)[D];內(nèi)蒙古大學(xué);2009年

8 胡鵬飛;Lucene與中文分詞技術(shù)的研究及應(yīng)用[D];北京交通大學(xué);2010年

本文編號(hào)：2119904

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2119904.html

上一篇：基于PageRank算法的搜索引擎優(yōu)化策略
下一篇：一種用于Web檢索交互的相關(guān)主題查詢建議方法

論文發(fā)表

·知網(wǎng)|萬(wàn)方|維普|龍?jiān)磡省級(jí)|國(guó)家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于Lucene的個(gè)性化搜索引擎系統(tǒng)