分布式在線旅游搜索爬蟲系統(tǒng)設(shè)計與實現(xiàn)
本文選題:搜索引擎 + 在線旅游; 參考:《北京郵電大學(xué)》2013年碩士論文
【摘要】:隨著Internet技術(shù)和旅游業(yè)的蓬勃發(fā)展,特別是近年來人們生活水平的提高以及在線旅游業(yè)的興起,越來越多的用戶傾向于網(wǎng)上訂購在線旅游線路出游。由于在線旅游線路網(wǎng)頁的急劇增多,在線旅游搜索引擎己經(jīng)成為當(dāng)前搜索引擎發(fā)展的一個重要的研究方向。 本文首先介紹了分布式在線旅游搜索爬蟲系統(tǒng)的研究背景及意義、網(wǎng)絡(luò)爬蟲的研究現(xiàn)狀等知識,結(jié)合搜索引擎的工作原理以及分布式網(wǎng)絡(luò)爬蟲的相關(guān)技術(shù)和策略,對本系統(tǒng)中需要用到的關(guān)鍵技術(shù)做了詳細的分析與研究,其中重點研究了分布式任務(wù)分配策略及粒度選擇、URL去重技術(shù)和在線旅游線路網(wǎng)頁的更新策略,并根據(jù)旅游線路網(wǎng)頁的特點,提出了一個專門針對在線旅游線路網(wǎng)頁的判別算法。 在以上這些關(guān)鍵技術(shù)和策略的基礎(chǔ)上,設(shè)計并實現(xiàn)了一個以用戶對在線旅游線路網(wǎng)頁搜索需求作為研究背景,以互聯(lián)網(wǎng)上的旅游度假平臺網(wǎng)站和普通旅行社網(wǎng)站內(nèi)容作為旅游線路信息的采集對象的分布式在線旅游搜索爬蟲系統(tǒng)。在系統(tǒng)設(shè)計部分,本文按照實現(xiàn)的功能將分布式在線旅游搜索爬蟲系統(tǒng)劃分成了四個主要模塊,分別為控制服務(wù)器、爬蟲服務(wù)器、索引檢索服務(wù)器以及數(shù)據(jù)庫模塊,并對各個模塊的結(jié)構(gòu)進行了詳細的描述,同時給出了類圖設(shè)計。最后,詳細介紹了控制服務(wù)器和爬蟲服務(wù)器的實現(xiàn)過程,并使用JAVSA作為開發(fā)語言,以T0mcat+Apache+MySQL作為開發(fā)環(huán)境,實現(xiàn)了整個系統(tǒng)。 為了驗證整個分布式爬蟲系統(tǒng)的可行性,本文在最后部分使用了5臺服務(wù)器搭建運行測試環(huán)境,對系統(tǒng)進行了功能和性能測試。通過對在線旅游線路網(wǎng)頁判別算法準(zhǔn)確性進行測試,結(jié)果表明該算法能有效地判別一個網(wǎng)頁是否為在線旅游線路網(wǎng)頁,其準(zhǔn)確率達到了90%左右。運行測試結(jié)果表明,本文所設(shè)計的分布式在線旅游搜索爬蟲系統(tǒng)無論是以單臺服務(wù)器運行,還是整體運行,都能夠穩(wěn)定、高效地采集在線旅游線路網(wǎng)頁信息,并根據(jù)線路標(biāo)題建立倒排索引,使用戶可以通過一個基于WEB的圖形界面方便地檢索到所需要的旅游線路信息,達到了設(shè)計的目標(biāo),對旅游業(yè)的信息化有著重要的實際應(yīng)用價值。
[Abstract]:With the rapid development of Internet technology and tourism, especially the improvement of people's living standard and the rise of online tourism in recent years, more and more users tend to order online travel routes. Because of the rapid increase of online travel route web pages, online tourism search engine has become an important research direction of the current search engine development. This paper first introduces the research background and significance of distributed online tourism search crawler system, the status quo of web crawler research, combined with the working principle of search engine, as well as the related technology and strategy of distributed web crawler. The key technologies used in this system are analyzed and studied in detail, in which the distributed task allocation strategy, the granularity selection URL removal technology and the updating strategy of online travel route web pages are emphatically studied. According to the characteristics of the travel route web page, a discriminant algorithm for the online travel route page is proposed. On the basis of the above key technologies and strategies, we design and implement a research background based on the users' search requirements for online travel route web pages. A distributed online travel search crawler system, which takes the content of tourist vacation platform on the Internet and the content of common travel agency website as the object of collecting information of travel route. In the part of system design, this paper divides the distributed online tourism search crawler system into four main modules according to the function, which are control server, crawler server, index retrieval server and database module. The structure of each module is described in detail, and the class diagram design is given. Finally, the realization process of control server and crawler server is introduced in detail, and the whole system is realized by using JAVSA as the development language and T0mcat Apache MySQL as the development environment. In order to verify the feasibility of the whole distributed crawler system, in the last part of this paper, five servers are used to build the running test environment, and the function and performance of the system are tested. The accuracy of the algorithm is tested. The results show that the algorithm can effectively distinguish whether a web page is an online tourism page, and the accuracy is about 90%. The running test results show that the distributed online travel search crawler system designed in this paper, whether running on a single server or as a whole, can steadily and efficiently collect the information of online travel route web pages. The inverted index is built according to the title of the line, so that the user can easily retrieve the needed information of the tour route through a graphical interface based on WEB, which achieves the goal of the design, and has important practical application value for the information of tourism industry.
【學(xué)位授予單位】:北京郵電大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2013
【分類號】:TP391.3
【參考文獻】
相關(guān)期刊論文 前7條
1 劉世濤;;簡析搜索引擎中網(wǎng)絡(luò)爬蟲的搜索策略[J];阜陽師范學(xué)院學(xué)報(自然科學(xué)版);2006年03期
2 尹江;尹治本;黃洪;;網(wǎng)絡(luò)爬蟲效率瓶頸的分析與解決方案[J];計算機應(yīng)用;2008年05期
3 姚樹宇,趙少東;一種使用分布式技術(shù)的搜索引擎[J];計算機應(yīng)用與軟件;2005年10期
4 徐娜;劉四維;汪翔;倪衛(wèi)明;;基于Bloom Filter的網(wǎng)頁去重算法[J];微型電腦應(yīng)用;2011年03期
5 傅向華,馮博琴,馬兆豐,何明;可在線增量自學(xué)習(xí)的聚焦爬行方法[J];西安交通大學(xué)學(xué)報;2004年06期
6 陳璐;;我國旅游電子商務(wù)的發(fā)展現(xiàn)狀及對策分析[J];中國商貿(mào);2012年02期
7 王海霞;;我國旅游電子商務(wù)發(fā)展分析[J];中國證券期貨;2011年10期
相關(guān)碩士學(xué)位論文 前3條
1 蘇旋;分布式網(wǎng)絡(luò)爬蟲技術(shù)的研究與實現(xiàn)[D];哈爾濱工業(yè)大學(xué);2006年
2 羅兵;支持AJAX的互聯(lián)網(wǎng)搜索引擎爬蟲設(shè)計與實現(xiàn)[D];浙江大學(xué);2007年
3 左軍;基于Lucene網(wǎng)絡(luò)視頻垂直搜索系統(tǒng)的設(shè)計與實現(xiàn)[D];北京郵電大學(xué);2007年
,本文編號:1843041
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1843041.html