天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當(dāng)前位置:主頁(yè) > 科技論文 > 搜索引擎論文 >

基于Lucene的個(gè)性化搜索引擎系統(tǒng)

發(fā)布時(shí)間:2018-07-13 15:35
【摘要】:互聯(lián)網(wǎng)快速發(fā)展帶來(lái)了知識(shí)爆炸,用戶面臨海量數(shù)據(jù)必須借助檢索工具的幫助來(lái)找到需要的信息。需求決定市場(chǎng),以Google和百度為代表的各種搜索應(yīng)用應(yīng)運(yùn)而生,改變了互聯(lián)網(wǎng)。 傳統(tǒng)檢索技術(shù)在理論和實(shí)踐上都已經(jīng)相當(dāng)成熟,開(kāi)源社區(qū)誕生了諸如Xapian、Lucene等第三方API庫(kù),以及基于第三方庫(kù)的完整搜索解決方案。本文對(duì)搜索引擎的原理、組成、工作流程等方面做了深入細(xì)致地分析,介紹了每個(gè)模塊的理論基礎(chǔ),,并且著重研究改進(jìn)著名API類庫(kù)Lucene,分析Lucene的模塊結(jié)構(gòu)、文件格式、索引過(guò)程、結(jié)果排序。 目前,主流的搜索解決方案并不提供對(duì)javascript腳本的支持,在網(wǎng)頁(yè)爬取數(shù)量和速度上做了折中。近年來(lái)出現(xiàn)的快速javascript解釋引擎為解決這個(gè)問(wèn)題提供了可能。本文在爬蟲(chóng)模塊引入了腳本解釋引擎以提高爬蟲(chóng)對(duì)javascript腳本的理解,模仿C++重載運(yùn)算符的原理,將Javascript中涉及到URL賦值的運(yùn)算符重載成集合運(yùn)算,實(shí)現(xiàn)了對(duì)腳本URL的提取,在內(nèi)網(wǎng)和外網(wǎng)中做了對(duì)比試驗(yàn),總結(jié)了內(nèi)網(wǎng)成功和外網(wǎng)失敗的原因。 鏈接分析是衡量網(wǎng)頁(yè)質(zhì)量的重要參數(shù),本文在Lucene原有的評(píng)分公式中引入了PageRank算法來(lái)提高的網(wǎng)頁(yè)評(píng)分的準(zhǔn)確性,以改善搜索結(jié)果質(zhì)量,并且在原有冪迭代計(jì)算基礎(chǔ)之上提出了簡(jiǎn)易的計(jì)算方式,提高了計(jì)算速度。Lucene設(shè)計(jì)優(yōu)秀,在各個(gè)功能模塊開(kāi)放了大量的接口以滿足用戶的自定義需求。利用這些接口,本文做了工程實(shí)踐,并與原有的評(píng)分公式做了實(shí)驗(yàn)對(duì)比。 最后,本文在搜索引擎的個(gè)性化方面做了實(shí)踐探索。傳統(tǒng)搜索引擎一般基于關(guān)鍵字匹配,沒(méi)有充分使用用戶的個(gè)性信息,缺少個(gè)性化功能。本文介紹了用戶信息的收集、用戶模型的建立以及使用,并在這些理論的指導(dǎo)下,結(jié)合工程難度,設(shè)計(jì)實(shí)現(xiàn)了一個(gè)簡(jiǎn)單的個(gè)性化搜索模塊。實(shí)驗(yàn)結(jié)果表明,論文設(shè)計(jì)實(shí)現(xiàn)的個(gè)性化功能模塊是有效的。
[Abstract]:The rapid development of the Internet has brought about a knowledge explosion. Users have to use the help of retrieval tools to find the information they need in the face of massive data. Demand determines the market, and various search applications represented by Google and Baidu emerge as the times require, changing the Internet. The traditional retrieval technology has been quite mature in theory and practice. The open source community has developed third-party API libraries such as Xapianli Lucene and a complete search solution based on third-party libraries. In this paper, the principle, composition and workflow of search engine are analyzed in detail, the theoretical basis of each module is introduced, and the improvement of the famous API class library Luceneis emphatically studied, and the module structure, file format, indexing process of Lucene are analyzed. The results are sorted. Currently, mainstream search solutions do not provide support for javascript scripts, making a compromise on the number and speed of web crawls. The emergence of fast javascript interpretation engines in recent years makes it possible to solve this problem. In this paper, the script interpretation engine is introduced into the crawler module to improve the crawler's understanding of the javascript script, imitating the principle of C overloading operator, reloading the operators involved in the assignment of URLs into set operations in Javascript, and realizing the extraction of script URLs. A comparative experiment is made between the intranet and the extranet, and the reasons for the success and failure of the intranet are summarized. Link analysis is an important parameter to measure the quality of web pages. In this paper, PageRank algorithm is introduced into Lucene's original scoring formula to improve the accuracy of web pages, so as to improve the quality of search results. And on the basis of the original power iterative calculation, a simple calculation method is proposed, which improves the speed of calculation. Lucene design is excellent, and a large number of interfaces are opened in each functional module to meet the user's custom requirements. Using these interfaces, this paper has done the engineering practice, and has made the experiment contrast with the original scoring formula. Finally, this article has made the practice exploration in the search engine personalization aspect. Traditional search engines are generally based on keyword matching, do not fully use the user's personality information, lack of personalized function. This paper introduces the collection of user information, the establishment and use of user model, and under the guidance of these theories, a simple personalized search module is designed and implemented under the guidance of these theories and combining with the engineering difficulty. The experimental results show that the personalized function module designed in this paper is effective.
【學(xué)位授予單位】:中國(guó)艦船研究院
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2013
【分類號(hào)】:TP391.3

【參考文獻(xiàn)】

相關(guān)博士學(xué)位論文 前1條

1 何因;排序?qū)W習(xí)中基于直接優(yōu)化信息檢索評(píng)價(jià)準(zhǔn)則算法的理論分析[D];中國(guó)科學(xué)技術(shù)大學(xué);2010年

相關(guān)碩士學(xué)位論文 前8條

1 金艷偉;基于馬爾可夫隨機(jī)場(chǎng)的蒙古文信息檢索模型研究[D];內(nèi)蒙古大學(xué);2011年

2 梁萍;搜索引擎中網(wǎng)絡(luò)爬蟲(chóng)及結(jié)果聚類的研究與實(shí)現(xiàn)[D];中國(guó)科學(xué)技術(shù)大學(xué);2011年

3 張校乾;基于Lucene的全文檢索系統(tǒng)的研究與應(yīng)用[D];大連理工大學(xué);2005年

4 何世林;基于Java技術(shù)的搜索引擎研究與實(shí)現(xiàn)[D];西南交通大學(xué);2006年

5 岑杰;面向情報(bào)領(lǐng)域的文本自動(dòng)分類系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)[D];西安電子科技大學(xué);2008年

6 亢明;基于矢量量化的語(yǔ)音識(shí)別及全文檢索研究[D];重慶大學(xué);2009年

7 楊光偉;基于Lucene的個(gè)性化搜索引擎的研究與實(shí)現(xiàn)[D];內(nèi)蒙古大學(xué);2009年

8 胡鵬飛;Lucene與中文分詞技術(shù)的研究及應(yīng)用[D];北京交通大學(xué);2010年



本文編號(hào):2119904

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2119904.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶09211***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請(qǐng)E-mail郵箱bigeng88@qq.com
久久精品视频就在久久| 国产午夜精品亚洲精品国产| 久久精品国产在热久久| 国产精品亚洲综合色区韩国 | 婷婷激情四射在线观看视频| 国产成人亚洲精品青草天美| 欧美精品激情视频一区| 日本乱论一区二区三区| 国产高清精品福利私拍| 欧美日韩在线观看自拍| 亚洲最新一区二区三区| 国产精品亚洲二区三区| 夫妻性生活动态图视频| 好吊妞视频这里有精品| 国产偷拍精品在线视频| 国产一级性生活录像片| 日韩特级黄片免费在线观看| 亚洲精品高清国产一线久久| 欧美尤物在线视频91| 成人欧美一区二区三区视频| 国产精品午夜福利免费在线| 午夜精品久久久99热连载| 热情的邻居在线中文字幕| 亚洲av日韩av高潮无打码| 欧洲一区二区三区蜜桃| 日韩不卡一区二区视频| 欧美乱视频一区二区三区| 久久精品中文字幕人妻中文| 久久精品蜜桃一区二区av| 国产一区在线免费国产一区| 国产成人精品综合久久久看| 免费性欧美重口味黄色| 久久国产精品熟女一区二区三区| 成人综合网视频在线观看| 亚洲性生活一区二区三区| 亚洲精品一区三区三区| 欧美成人精品国产成人综合| 欧美激情一区=区三区| 日本av在线不卡一区| 中文字幕人妻av不卡| 精品国产亚洲一区二区三区|