基于Hadoop的分布式垂直搜索引擎研究與設計
發(fā)布時間:2018-08-07 14:59
【摘要】:隨著互聯(lián)網(wǎng)的發(fā)展,網(wǎng)絡技術日趨成熟,互聯(lián)網(wǎng)上的站點越來越多,信息量非常的巨大。但是由于網(wǎng)絡技術的發(fā)展與網(wǎng)絡資源增長速度加快,網(wǎng)絡信息的用戶也越來越多,相比之下,傳統(tǒng)綜合搜索引擎存在覆蓋率范圍有限、返回結果多而繁雜、更新周期長以及查詢歧義等諸多問題。 與此同時,信息多元化的不斷增長,不同用戶的檢索需求存在很大差異,傳統(tǒng)綜合搜索引擎已不能有針對性地滿足不同的檢索需求。且目前成功運營的商業(yè)搜索引擎大部分采用了集中式體系結構,系統(tǒng)對單臺服務器性能要求高,易出現(xiàn)故障、擴展性差等。針對這些缺點,一個性能佳、容錯好、擴展容易、分類細致精確、數(shù)據(jù)全面深入、更新及時的分布式垂直搜索便應運而生。 分布式是指多臺服務器構建一個集群,服務器之間相互協(xié)調進行工作;垂直搜索是指針對某一行業(yè)的專業(yè)搜索,其特點是“專、精、深”,具有鮮明行業(yè)特色,是通用搜索引擎的細分和延伸。本課題采用Hadoop搭建了分布式集群,然后對開源搜索組件Nutch和Solr進行源碼分析,接著深入了解搜索引擎相關理論知識和研究搜索引擎的關鍵技術,在此基礎上借鑒已有學術成果,,在主題相關性判別、網(wǎng)頁檢索排序等方面做了一些改進,利用領域本體知識構建鋼鐵領域本體庫,擴展用戶查詢條件,使信息的定位和查找更加的精確,最后修改開源搜索組件源代碼基于Hadoop設計并實現(xiàn)了分布式垂直搜索引擎雛形,并與百度商業(yè)搜索引擎比較搜索結果,對實驗結果進行分析和評價后,證明本系統(tǒng)具有明顯的主題傾向性,查準率優(yōu)于通用搜索引擎。
[Abstract]:With the development of the Internet, network technology is becoming more and more mature, more and more sites on the Internet, the amount of information is very huge. However, due to the rapid development of network technology and the rapid growth of network resources, more and more users of network information, by contrast, the traditional comprehensive search engine has limited coverage, returns many and complex results. Long update period and query ambiguity and many other issues. At the same time, with the increasing of information diversification, the retrieval needs of different users are very different. The traditional integrated search engine can no longer meet the different retrieval needs. Most of the successful commercial search engines use centralized architecture. The system requires high performance of a single server, prone to failure, poor scalability and so on. In order to solve these problems, a distributed vertical search with timely updating is proposed, which has the advantages of good performance, good fault tolerance, easy expansion, precise classification and thorough data. Distributed refers to the construction of a cluster of multiple servers, where servers work in coordination with each other. Vertical search refers to a professional search for a particular industry, which is characterized by "specialty, precision, depth", with distinctive industry characteristics. General search engine is the subdivision and extension. This paper uses Hadoop to build a distributed cluster, then analyzes the open source search components Nutch and Solr, then deeply understand the relevant theoretical knowledge of search engine and research the key technologies of search engine, and draw lessons from the existing academic achievements. Some improvements have been made in the aspects of topic correlation discrimination, web search and ranking. The domain ontology knowledge is used to construct the steel domain ontology database, and the query conditions of users are extended, so that the information can be located and searched more accurately. Finally, the prototype of distributed vertical search engine is designed and implemented based on Hadoop, and the search results are compared with those of Baidu commercial search engine, and the experimental results are analyzed and evaluated. It is proved that the system has obvious thematic tendency and the precision rate is superior to that of the general search engine.
【學位授予單位】:河北工業(yè)大學
【學位級別】:碩士
【學位授予年份】:2012
【分類號】:TP391.3
本文編號:2170395
[Abstract]:With the development of the Internet, network technology is becoming more and more mature, more and more sites on the Internet, the amount of information is very huge. However, due to the rapid development of network technology and the rapid growth of network resources, more and more users of network information, by contrast, the traditional comprehensive search engine has limited coverage, returns many and complex results. Long update period and query ambiguity and many other issues. At the same time, with the increasing of information diversification, the retrieval needs of different users are very different. The traditional integrated search engine can no longer meet the different retrieval needs. Most of the successful commercial search engines use centralized architecture. The system requires high performance of a single server, prone to failure, poor scalability and so on. In order to solve these problems, a distributed vertical search with timely updating is proposed, which has the advantages of good performance, good fault tolerance, easy expansion, precise classification and thorough data. Distributed refers to the construction of a cluster of multiple servers, where servers work in coordination with each other. Vertical search refers to a professional search for a particular industry, which is characterized by "specialty, precision, depth", with distinctive industry characteristics. General search engine is the subdivision and extension. This paper uses Hadoop to build a distributed cluster, then analyzes the open source search components Nutch and Solr, then deeply understand the relevant theoretical knowledge of search engine and research the key technologies of search engine, and draw lessons from the existing academic achievements. Some improvements have been made in the aspects of topic correlation discrimination, web search and ranking. The domain ontology knowledge is used to construct the steel domain ontology database, and the query conditions of users are extended, so that the information can be located and searched more accurately. Finally, the prototype of distributed vertical search engine is designed and implemented based on Hadoop, and the search results are compared with those of Baidu commercial search engine, and the experimental results are analyzed and evaluated. It is proved that the system has obvious thematic tendency and the precision rate is superior to that of the general search engine.
【學位授予單位】:河北工業(yè)大學
【學位級別】:碩士
【學位授予年份】:2012
【分類號】:TP391.3
【參考文獻】
相關期刊論文 前8條
1 譚月輝;肖冰;陳建泗;齊京禮;李志勇;;Jena推理機制及應用研究[J];河北省科學院學報;2009年04期
2 宋玉銀,蔡復之,張伯鵬,許隆文;面向并行工程的集成產(chǎn)品信息建模技術研究[J];計算機研究與發(fā)展;1998年02期
3 鄭霄;李宏亮;吳東;原昊;;分布式狀態(tài)空間生成的設計與實現(xiàn)[J];計算機工程與應用;2009年32期
4 胡玉杰,李善平,郭鳴;基于本體的產(chǎn)品知識表達[J];計算機輔助設計與圖形學學報;2003年12期
5 孫正興,張福炎;特征設計方法在方案設計中的應用初探[J];機械設計與研究;1999年01期
6 劉琳娜;薛建武;汪小梅;;領域本體構建方法的研究[J];情報雜志;2007年04期
7 封碩;趙捧未;施水才;;基于RSS的分布式博客搜索引擎的研究[J];情報雜志;2007年08期
8 耿科明;袁方;;Jena推理機在基于本體的信息檢索中的應用[J];微型機與應用;2005年10期
本文編號:2170395
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2170395.html
最近更新
教材專著