天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當(dāng)前位置:主頁 > 科技論文 > 軟件論文 >

基于Nutch和Solr的旅游信息垂直搜索引擎的研究和實(shí)現(xiàn)

發(fā)布時(shí)間:2018-07-12 11:12

  本文選題:垂直搜索引擎 + 旅游信息。 參考:《海南大學(xué)》2016年碩士論文


【摘要】:隨著網(wǎng)絡(luò)的迅速發(fā)展,萬維網(wǎng)成為大量信息的載體,搜索引擎作為人們獲取并利用這些信息的重要工具,成為用戶訪問訪問萬維網(wǎng)的入口和指南。傳統(tǒng)的通用搜索引擎技術(shù)不加區(qū)分搜羅全網(wǎng)數(shù)據(jù),雖然覆蓋全面但是也存在結(jié)果繁多這一缺點(diǎn),從而提高了有特定需求用戶的篩選成本。垂直搜索引擎僅僅采集某一特定領(lǐng)域相關(guān)的頁面,可以更加精確、迅速地讓用戶獲取到其關(guān)心領(lǐng)域的信息。面向旅游領(lǐng)域的垂直搜索引擎,可以讓旅游者、旅游業(yè)從業(yè)人員等相關(guān)人員迅速獲取旅游類信息。Nutch是Apache旗下的Java開源網(wǎng)絡(luò)爬蟲,主要用于搜集網(wǎng)頁數(shù)據(jù),然后對(duì)爬取到的網(wǎng)頁進(jìn)行分析,它與開源全文索引框架Solr結(jié)合,可以搭建一個(gè)搜索引擎系統(tǒng)原型。本課題在研究其基礎(chǔ)上,通過改造的相關(guān)功能模塊,改進(jìn)相關(guān)算法,實(shí)現(xiàn)了一個(gè)面向旅游領(lǐng)域的垂直搜索引擎。本文的主要研究內(nèi)容如下:(1)首先,明確研究背景、研究意義,了解搜索引擎的工作原理、發(fā)展史以及它的2種分類方式。闡述通用搜索引擎存在的不足以及垂直搜索引擎存在的優(yōu)勢(shì)。其次,在分析垂直搜索引擎的關(guān)鍵點(diǎn)后,提出針對(duì)旅游信息的主題爬蟲模型。(2)垂直搜索引擎與通用搜索引擎最顯著的區(qū)別就是采集內(nèi)容的主題性。在選擇一定數(shù)量的樣本文檔采用文檔頻率DF結(jié)合人工篩選建立旅游主題詞庫后,爬取過程中應(yīng)用主題相關(guān)性判定算法結(jié)合主題詞庫對(duì)網(wǎng)頁進(jìn)行主題相關(guān)性判斷,過濾與旅游主題相關(guān)性差的網(wǎng)頁。(3)在索引過程中引入IK-Analyzer來增強(qiáng)搜索引擎對(duì)中文分詞的支持,并且擴(kuò)展其詞庫,加入主題詞庫內(nèi)容,擴(kuò)充停用詞。網(wǎng)頁排序算法的優(yōu)劣與用戶查詢體驗(yàn)緊密相關(guān),在搜索排序中,采用基于PageRank算法結(jié)合主題相關(guān)度改進(jìn)網(wǎng)頁評(píng)分,使得在網(wǎng)頁排序時(shí)考慮到頁面權(quán)威性和主題性這樣的因素。(4)借鑒各大搜索引擎的UI設(shè)計(jì)設(shè)計(jì)實(shí)現(xiàn)良好的用戶檢索界面,提升用戶體驗(yàn)度。(5)在深入了解Nutch和Solr的工作原理、源碼實(shí)現(xiàn)后,針對(duì)旅游領(lǐng)域主題采集這一目標(biāo)提出自己的創(chuàng)新思路和解決辦法,并對(duì)其進(jìn)行二次開發(fā),實(shí)現(xiàn)基于Nutch和Solr旅游信息垂直搜索引擎系統(tǒng)。在服務(wù)器上,搭建Hadoop分布式平臺(tái),并部署系統(tǒng)進(jìn)行運(yùn)行與測(cè)試。
[Abstract]:With the rapid development of the network, the world wide web has become the carrier of a large number of information. As an important tool for people to obtain and use these information, the search engine has become the entrance and guide for users to access the world wide web. The traditional general search engine technology does not separate the whole network data, although it covers a wide range but also has a wide range of results. It improves the cost of screening for users with specific requirements. Vertical search engines only collect specific domain related pages so that users can get information about their areas of concern more accurately and quickly. Vertical search engines in the tourism field can allow travelers, tourism practitioners and other related personnel to get quickly. The tourist information.Nutch is the Java open source web crawler under Apache, which is mainly used to collect web data, and then analyzes the crawled web pages. It can be combined with the open source full text index framework Solr to build a prototype of the search engine system. A vertical search engine oriented to tourism is implemented. The main contents of this paper are as follows: (1) first, the research background, the research significance, the working principle of the search engine, the history of the development and its 2 types of classification are discussed. The shortcomings of the general search engine and the advantages of the vertical search engine are expounded. Secondly, After analyzing the key points of the vertical search engine, the theme crawler model for tourism information is proposed. (2) the most significant difference between the vertical search engine and the general search engine is the subject nature of the collection of contents. After selecting a certain number of sample documents by using the document frequency DF and the artificial selection of the tourist theme lexicon, the crawling process is used. The application of thematic correlation determination algorithm combined with topic word library to judge the topic relevance of the web page. (3) introducing IK-Analyzer in the index process to enhance the support of the search engine to Chinese word segmentation, and expand its thesaurus, add the content of the thesaurus, expand the disuse words. The web sort algorithm The advantages and disadvantages are closely related to the user's query experience. In the search sorting, the PageRank algorithm is used to improve the web page score based on the correlation degree of the subject. The factors such as the page authority and the theme are taken into account in the web page sorting. (4) learning from the UI design and design of the major search engines to achieve a good user retrieval interface and improve the user experience (5) (5) after a thorough understanding of the working principle of Solr and the realization of the source code, we put forward his own innovative ideas and solutions to the target collection in the tourism field, and carry out two development to realize the vertical search engine system based on Nutch and Solr tourism information. On the server, build the Hadoop distributed platform and deploy the system. Run and test.
【學(xué)位授予單位】:海南大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2016
【分類號(hào)】:TP391.3
,

本文編號(hào):2116971

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2116971.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶18bb8***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請(qǐng)E-mail郵箱bigeng88@qq.com