天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當前位置:主頁 > 科技論文 > 搜索引擎論文 >

基于爬蟲的Sohu新聞搜索引擎設計與實現(xiàn)

發(fā)布時間:2018-01-23 12:29

  本文關鍵詞: 搜索引擎 排序算法 Lucene PageRank Hadoop 出處:《中山大學》2012年碩士論文 論文類型:學位論文


【摘要】:互聯(lián)網(wǎng)信息增長速度驚人,為了在海量數(shù)據(jù)中快速找到有用的信息,搜索引擎技術(shù)成為了網(wǎng)民關注的熱點。本論文的新聞搜索引擎就是在這樣的環(huán)境下應運而生。 對普通用戶來說,商業(yè)的搜索引擎基本上能滿足其應用需求。但是對于特定的用戶來說,譬如中小企業(yè)用戶或者科研機構(gòu)等,因為商業(yè)搜索引擎信息的針對性較低,同時存在不能按需配置等缺陷,他們的應用需求不能通過商業(yè)互聯(lián)網(wǎng)搜索引擎得到完全滿足。Lucene等開源軟件的出現(xiàn)很好地滿足了這個需求,由于它們是完全開源的,開發(fā)人員完全能夠根據(jù)需求開發(fā)出適用于具體領域的搜索引擎。本文系統(tǒng)就是基于開源軟件設計并實現(xiàn)的。 本文首先介紹了搜索引擎的發(fā)展歷史、趨勢及搜索引擎的分類,然后,闡述了系統(tǒng)需求分析,明確系統(tǒng)功能需求與非功能需求,接著設計系統(tǒng)框架與相關系統(tǒng)體系結(jié)構(gòu),最后詳細設計各個功能模塊并將之實現(xiàn)。 本系統(tǒng)為基于爬蟲的Sohu新聞搜索引擎,使用二次開發(fā)方法,實現(xiàn)了Heritrix數(shù)據(jù)抓取模塊,HTMLParser數(shù)據(jù)預處理模塊,Lucene索引與Oracle數(shù)據(jù)庫數(shù)據(jù)生成模塊及Lucene搜索核心處理模塊等。為了提高用戶體驗,,本文結(jié)合Lucene文本匹配算法與PageRank算法,并考慮了時間因素對新聞搜索引擎的影響,提出了一種改進的頁面排序算法,在此基礎上,設計并實現(xiàn)了一種基于Lucene與Hadoop分布式存儲與分布式計算的算法實現(xiàn)方案,從而使展現(xiàn)給用戶的搜索結(jié)果更加準確,更加合理。
[Abstract]:In order to find useful information quickly in mass data , search engine technology has become a hot spot for Internet users . The news search engine of this paper is born in such an environment . For ordinary users , commercial search engines can basically meet their application requirements . But for specific users , such as small and medium - sized enterprises users or scientific research institutions , etc . , because of the low pertinence of commercial search engine information , and the existence of such defects as cannot be configured on demand , their application needs cannot be fully met through commercial Internet search engines . Since they are fully open source , developers can develop search engines suitable for specific fields according to requirements . The system is based on open source software design and realized . This paper first introduces the development history , trend of search engine and the classification of search engine , then expounds the system requirement analysis , specifies system function requirement and non - functional requirements , then designs the system framework and relevant system architecture , and finally designs each function module in detail and realizes it . In order to improve the user experience , the paper combines the Lucene text matching algorithm with the PageRank algorithm and the Lucene search core processing module . In order to improve the user experience , this paper combines the Lucene text matching algorithm and the PageRank algorithm , and considers the influence of time factors on the news search engine , and puts forward an improved page ordering algorithm . On the basis of this , a new algorithm implementation scheme based on Lucene and Hadoop distributed storage and distributed computing is designed and implemented , so that the search results presented to the user are more accurate and more reasonable .

【學位授予單位】:中山大學
【學位級別】:碩士
【學位授予年份】:2012
【分類號】:TP391.3;TP311.52

【參考文獻】

相關期刊論文 前2條

1 段淮川;胡平;;基于主題特征和時間因子的改進PageRank算法[J];計算機工程與設計;2010年04期

2 王春花;朱俊平;;改進的非平均傳遞權(quán)值PageRank算法[J];計算機工程與設計;2010年10期



本文編號:1457538

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1457538.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶76627***提供,本站僅收錄摘要或目錄,作者需要刪除請E-mail郵箱bigeng88@qq.com