基于Nutch的圖情博客搜索引擎的設(shè)計與實現(xiàn)

發(fā)布時間：2018-05-12 15:59

本文選題：Lucene + Nutch��；參考：《鄭州大學》2011年碩士論文

【摘要】：隨著Web2.0理念和技術(shù)的興起,全球互聯(lián)網(wǎng)用戶享受到了豐富多樣的交互性信息服務,博客正是這種交互性信息服務的典型代表。在這種時代背景下,圖書情報學領(lǐng)域的學生、研究人員等紛紛開設(shè)博客進行信息交流。然而,圖情博客分散、博文質(zhì)量參差不齊的現(xiàn)狀給界內(nèi)人士的使用帶來了不便,雖然Google博客搜索、百度博客搜索等相關(guān)的主題搜索引擎已經(jīng)解決了一些問題,但是仍舊不能滿足圖情界用戶的需求。本文就是針對這一問題嘗試構(gòu)建圖情博客搜索引擎,以滿足圖情用戶的需求。本文首先對搜索引擎相關(guān)技術(shù)和圖情博客進行分析,其次引入Nutch這一開源搜索引擎并基于Nutch制定了圖情博客搜索引擎的設(shè)計方案,再次依據(jù)該設(shè)計方案開發(fā)出相應的主題搜索引擎,最后以實驗的方法對該主題搜索引擎的性能進行評估。本文各章節(jié)的主要內(nèi)容如下： 1緒論。本章介紹了論文的選題背景、選題意義、國內(nèi)外研究現(xiàn)狀、所采用的研究方法和創(chuàng)新之處。 2搜索引擎相關(guān)技術(shù)及圖情博客分析。本章首先分析了搜索引擎、主題搜索引擎的運行原理,指出二者的主要區(qū)別在于信息采集模塊和網(wǎng)頁內(nèi)容解析模塊兩部分,主題搜索引擎改進了網(wǎng)絡爬蟲模塊和并在網(wǎng)頁內(nèi)容解析模塊增加了主題詞庫進行信息過濾。其次從博客站點結(jié)構(gòu)、博客頁面內(nèi)容、博客之間鏈接結(jié)構(gòu)三方面對圖情博客進行分析,以增加對圖情博客的全面認識。 3 Nutch簡介及Nutch系統(tǒng)的運行環(huán)境配置與運行。本章首先介紹了開源搜索引擎Nutch的基本情況和框架結(jié)構(gòu),對Nutch有個初步的認識。然后配置Nutch系統(tǒng)的運行環(huán)境并詳細闡釋其運行流程,對Nutch的運行原理和詳細結(jié)構(gòu)有進一步的認識。 4基于Nutch的圖情博客搜索引擎的設(shè)計。本章依據(jù)軟件工程的思想,首先分析搜索引擎系統(tǒng)要實現(xiàn)的目標、要解決的問題以及可行性,然后通過用例圖(Use Case Diagram)和序列圖(Sequence Diagram)對系統(tǒng)的用戶需求進行闡述,最后給出了系統(tǒng)的總體設(shè)計方案和詳細設(shè)計方案。 5基于Nutch的圖情博客搜索引擎的核心模塊實現(xiàn)。本章對詳細設(shè)計方案中的三個核心模塊進行實現(xiàn)。首先是借助圖書情報學的信息檢索理論和實踐對主題資源發(fā)現(xiàn)模塊進行實現(xiàn),其次是通過軟件分析對爬蟲模塊的采集策略進行實現(xiàn),最后是根據(jù)用戶需求對檢索模塊進行改進。 6實驗測試分析與結(jié)論。本章首先設(shè)定了一系列參數(shù)并依據(jù)這些參數(shù)進行了六輪實驗測試,然后對測試結(jié)果進行了分析。最后,筆者總結(jié)了圖情博客搜索引擎的特點及不足之處,并對以后的改進工作做出了展望。
[Abstract]:With the rise of Web2.0 concept and technology, Internet users worldwide enjoy a variety of interactive information services, blog is the typical representative of this interactive information service. Against this background, students and researchers in the field of library and information science have started blogs to exchange information. However, the scattered picture blog, the uneven quality of blog has brought inconvenience to the use of people in the field, although Google blog search, Baidu blog search and other related theme search engines have solved some problems. But still can not meet the needs of users. This paper attempts to build a blog search engine to meet the needs of users. In this paper, we first analyze the related technologies of search engine and map blog, then introduce Nutch, an open source search engine, and work out the design scheme of map blog search engine based on Nutch. At last, the performance of the theme search engine is evaluated by the experimental method. The main contents of each chapter are as follows: 1 introduction. This chapter introduces the background, significance, domestic and international research status, research methods and innovations. 2 search engine related technology and map blog analysis. This chapter first analyzes the operation principle of search engine and theme search engine, and points out that the main difference between them lies in two parts: information collection module and web page content analysis module. The topic search engine improves the web crawler module and adds the topic thesaurus to the web content parsing module for information filtering. Secondly, from the blog site structure, blog page content, blog links between the three aspects of the blog analysis, in order to increase the overall understanding of the picture blog. Introduction of Nutch and configuration and operation of Nutch system. This chapter first introduces the basic situation and framework of open source search engine Nutch, and has a preliminary understanding of Nutch. Then configure the running environment of Nutch system and explain its running flow in detail, and have a further understanding of the operation principle and detailed structure of Nutch. 4 the design of blog search engine based on Nutch. According to the idea of software engineering, this chapter first analyzes the goal, the problem and the feasibility of the search engine system, and then through use Case Diagrams and sequence Diagrams, the user needs of the system are expounded. Finally, the overall design scheme and detailed design scheme of the system are given. 5 the core module of blog search engine based on Nutch. In this chapter, three core modules in the detailed design scheme are implemented. Firstly, it implements the topic resource discovery module with the help of the information retrieval theory and practice of library and information science; secondly, it implements the crawler module's acquisition strategy through software analysis; finally, it improves the retrieval module according to the user's demand. 6 Experimental analysis and conclusion. In this chapter, a series of parameters are set up and six rounds of experimental tests are carried out according to these parameters, and then the test results are analyzed. Finally, the author summarizes the features and shortcomings of blog search engine, and makes a prospect for future improvement.
【學位授予單位】：鄭州大學
【學位級別】：碩士
【學位授予年份】：2011
【分類號】：G250.73

【參考文獻】

相關(guān)期刊論文前10條

1 王仕仲;寧龍兵;;基于Nutch的中文搜索引擎的研究與實現(xiàn)[J];電腦開發(fā)與應用;2009年07期

2 張斌;周爾寧;;基于Nutch的分布式紡織垂直搜索引擎研究[J];電腦知識與技術(shù);2009年21期

3 吳敏琦;丁岳偉;;基于Nutch的XML網(wǎng)站全文搜索引擎實現(xiàn)[J];計算機工程;2008年15期

4 徐飛;孫勁光;;中文分詞切分技術(shù)研究[J];計算機工程與科學;2008年05期

5 申晉;;基于Lucene和Nutch的林業(yè)垂直搜索引擎的研建[J];農(nóng)業(yè)網(wǎng)絡信息;2008年04期

6 胡濤;路紅英;;基于Nutch的搜索引擎的研究[J];計算機時代;2007年01期

7 劉高原;何偉娜;鄭浩;劉覺夫;;Nutch0.9中二分法中文分詞的實現(xiàn)[J];計算機時代;2009年04期

8 趙景明;張福學;;國外圖書情報學博客的定量分析[J];圖書館理論與實踐;2008年05期

9 劉高原;張國平;;基于Nutch的搜索引擎技術(shù)[J];平頂山學院學報;2008年05期

10 周鵬;吳華瑞;趙春江;楊寶祝;朱華吉;;基于Nutch農(nóng)業(yè)搜索引擎的研究與設(shè)計[J];計算機工程與設(shè)計;2009年03期

相關(guān)碩士學位論文前10條

1 侯震宇;主題型搜索引擎的研究與實現(xiàn)[D];中國科學院研究生院（文獻情報中心）;2003年

2 董祥千;搜索引擎設(shè)計分析與結(jié)果聚類改進[D];電子科技大學;2007年

3 劉強國;主題搜索引擎設(shè)計與研究[D];電子科技大學;2007年

4 葉勤勇;基于URL規(guī)則的聚焦爬蟲及其應用[D];浙江大學;2007年

5 蘇曉珂;基于Nutch的主題爬蟲研究與實現(xiàn)[D];昆明理工大學;2007年

6 胡曉博;面向特定領(lǐng)域的專業(yè)搜索引擎的架構(gòu)與實現(xiàn)方法[D];哈爾濱工程大學;2007年

7 黃波;主題搜索引擎的研究與應用[D];成都理工大學;2007年

8 李東海;基于Nutch技術(shù)的主題搜索引擎實現(xiàn)[D];吉林大學;2008年

9 常慶;風險主題搜索引擎相關(guān)技術(shù)的研究與應用[D];西北大學;2008年

10 張弛;基于WEB服務的空間信息專業(yè)搜索引擎的應用研究[D];廣西大學;2008年

，

本文編號：1879236

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1879236.html

上一篇：論注意力與網(wǎng)絡信息檢索的互動
下一篇：新媒體背景下電視綜藝節(jié)目營銷策略研究

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于Nutch的圖情博客搜索引擎的設(shè)計與實現(xiàn)