分布式在線旅游搜索爬蟲系統(tǒng)設(shè)計(jì)與實(shí)現(xiàn)

發(fā)布時(shí)間：2018-05-04 12:21

本文選題：搜索引擎 + 在線旅游　；參考：《北京郵電大學(xué)》2013年碩士論文

【摘要】：隨著Internet技術(shù)和旅游業(yè)的蓬勃發(fā)展,特別是近年來(lái)人們生活水平的提高以及在線旅游業(yè)的興起,越來(lái)越多的用戶傾向于網(wǎng)上訂購(gòu)在線旅游線路出游。由于在線旅游線路網(wǎng)頁(yè)的急劇增多,在線旅游搜索引擎己經(jīng)成為當(dāng)前搜索引擎發(fā)展的一個(gè)重要的研究方向。本文首先介紹了分布式在線旅游搜索爬蟲系統(tǒng)的研究背景及意義、網(wǎng)絡(luò)爬蟲的研究現(xiàn)狀等知識(shí),結(jié)合搜索引擎的工作原理以及分布式網(wǎng)絡(luò)爬蟲的相關(guān)技術(shù)和策略,對(duì)本系統(tǒng)中需要用到的關(guān)鍵技術(shù)做了詳細(xì)的分析與研究,其中重點(diǎn)研究了分布式任務(wù)分配策略及粒度選擇、URL去重技術(shù)和在線旅游線路網(wǎng)頁(yè)的更新策略,并根據(jù)旅游線路網(wǎng)頁(yè)的特點(diǎn),提出了一個(gè)專門針對(duì)在線旅游線路網(wǎng)頁(yè)的判別算法。在以上這些關(guān)鍵技術(shù)和策略的基礎(chǔ)上,設(shè)計(jì)并實(shí)現(xiàn)了一個(gè)以用戶對(duì)在線旅游線路網(wǎng)頁(yè)搜索需求作為研究背景,以互聯(lián)網(wǎng)上的旅游度假平臺(tái)網(wǎng)站和普通旅行社網(wǎng)站內(nèi)容作為旅游線路信息的采集對(duì)象的分布式在線旅游搜索爬蟲系統(tǒng)。在系統(tǒng)設(shè)計(jì)部分,本文按照實(shí)現(xiàn)的功能將分布式在線旅游搜索爬蟲系統(tǒng)劃分成了四個(gè)主要模塊,分別為控制服務(wù)器、爬蟲服務(wù)器、索引檢索服務(wù)器以及數(shù)據(jù)庫(kù)模塊,并對(duì)各個(gè)模塊的結(jié)構(gòu)進(jìn)行了詳細(xì)的描述,同時(shí)給出了類圖設(shè)計(jì)。最后,詳細(xì)介紹了控制服務(wù)器和爬蟲服務(wù)器的實(shí)現(xiàn)過(guò)程,并使用JAVSA作為開發(fā)語(yǔ)言,以T0mcat+Apache+MySQL作為開發(fā)環(huán)境,實(shí)現(xiàn)了整個(gè)系統(tǒng)。為了驗(yàn)證整個(gè)分布式爬蟲系統(tǒng)的可行性,本文在最后部分使用了5臺(tái)服務(wù)器搭建運(yùn)行測(cè)試環(huán)境,對(duì)系統(tǒng)進(jìn)行了功能和性能測(cè)試。通過(guò)對(duì)在線旅游線路網(wǎng)頁(yè)判別算法準(zhǔn)確性進(jìn)行測(cè)試,結(jié)果表明該算法能有效地判別一個(gè)網(wǎng)頁(yè)是否為在線旅游線路網(wǎng)頁(yè),其準(zhǔn)確率達(dá)到了90%左右。運(yùn)行測(cè)試結(jié)果表明,本文所設(shè)計(jì)的分布式在線旅游搜索爬蟲系統(tǒng)無(wú)論是以單臺(tái)服務(wù)器運(yùn)行,還是整體運(yùn)行,都能夠穩(wěn)定、高效地采集在線旅游線路網(wǎng)頁(yè)信息,并根據(jù)線路標(biāo)題建立倒排索引,使用戶可以通過(guò)一個(gè)基于WEB的圖形界面方便地檢索到所需要的旅游線路信息,達(dá)到了設(shè)計(jì)的目標(biāo),對(duì)旅游業(yè)的信息化有著重要的實(shí)際應(yīng)用價(jià)值。
[Abstract]:With the rapid development of Internet technology and tourism, especially the improvement of people's living standard and the rise of online tourism in recent years, more and more users tend to order online travel routes. Because of the rapid increase of online travel route web pages, online tourism search engine has become an important research direction of the current search engine development. This paper first introduces the research background and significance of distributed online tourism search crawler system, the status quo of web crawler research, combined with the working principle of search engine, as well as the related technology and strategy of distributed web crawler. The key technologies used in this system are analyzed and studied in detail, in which the distributed task allocation strategy, the granularity selection URL removal technology and the updating strategy of online travel route web pages are emphatically studied. According to the characteristics of the travel route web page, a discriminant algorithm for the online travel route page is proposed. On the basis of the above key technologies and strategies, we design and implement a research background based on the users' search requirements for online travel route web pages. A distributed online travel search crawler system, which takes the content of tourist vacation platform on the Internet and the content of common travel agency website as the object of collecting information of travel route. In the part of system design, this paper divides the distributed online tourism search crawler system into four main modules according to the function, which are control server, crawler server, index retrieval server and database module. The structure of each module is described in detail, and the class diagram design is given. Finally, the realization process of control server and crawler server is introduced in detail, and the whole system is realized by using JAVSA as the development language and T0mcat Apache MySQL as the development environment. In order to verify the feasibility of the whole distributed crawler system, in the last part of this paper, five servers are used to build the running test environment, and the function and performance of the system are tested. The accuracy of the algorithm is tested. The results show that the algorithm can effectively distinguish whether a web page is an online tourism page, and the accuracy is about 90%. The running test results show that the distributed online travel search crawler system designed in this paper, whether running on a single server or as a whole, can steadily and efficiently collect the information of online travel route web pages. The inverted index is built according to the title of the line, so that the user can easily retrieve the needed information of the tour route through a graphical interface based on WEB, which achieves the goal of the design, and has important practical application value for the information of tourism industry.
【學(xué)位授予單位】：北京郵電大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2013
【分類號(hào)】：TP391.3

【參考文獻(xiàn)】

相關(guān)期刊論文前7條

1 劉世濤;;簡(jiǎn)析搜索引擎中網(wǎng)絡(luò)爬蟲的搜索策略[J];阜陽(yáng)師范學(xué)院學(xué)報(bào)(自然科學(xué)版);2006年03期

2 尹江;尹治本;黃洪;;網(wǎng)絡(luò)爬蟲效率瓶頸的分析與解決方案[J];計(jì)算機(jī)應(yīng)用;2008年05期

3 姚樹宇,趙少東;一種使用分布式技術(shù)的搜索引擎[J];計(jì)算機(jī)應(yīng)用與軟件;2005年10期

4 徐娜;劉四維;汪翔;倪衛(wèi)明;;基于Bloom Filter的網(wǎng)頁(yè)去重算法[J];微型電腦應(yīng)用;2011年03期

5 傅向華,馮博琴,馬兆豐,何明;可在線增量自學(xué)習(xí)的聚焦爬行方法[J];西安交通大學(xué)學(xué)報(bào);2004年06期

6 陳璐;;我國(guó)旅游電子商務(wù)的發(fā)展現(xiàn)狀及對(duì)策分析[J];中國(guó)商貿(mào);2012年02期

7 王海霞;;我國(guó)旅游電子商務(wù)發(fā)展分析[J];中國(guó)證券期貨;2011年10期

相關(guān)碩士學(xué)位論文前3條

1 蘇旋;分布式網(wǎng)絡(luò)爬蟲技術(shù)的研究與實(shí)現(xiàn)[D];哈爾濱工業(yè)大學(xué);2006年

2 羅兵;支持AJAX的互聯(lián)網(wǎng)搜索引擎爬蟲設(shè)計(jì)與實(shí)現(xiàn)[D];浙江大學(xué);2007年

3 左軍;基于Lucene網(wǎng)絡(luò)視頻垂直搜索系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)[D];北京郵電大學(xué);2007年

，

本文編號(hào)：1843041

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/sousuoyinqinglunwen/1843041.html

上一篇：基于網(wǎng)頁(yè)內(nèi)容評(píng)價(jià)和Web圖的啟發(fā)式垂直搜索策略的設(shè)計(jì)
下一篇：特殊搜索引擎中的文本分類研究

論文發(fā)表

·知網(wǎng)|萬(wàn)方|維普|龍?jiān)磡省級(jí)|國(guó)家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

分布式在線旅游搜索爬蟲系統(tǒng)設(shè)計(jì)與實(shí)現(xiàn)