網(wǎng)頁防抓取系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)
本文選題:防抓取 切入點(diǎn):網(wǎng)絡(luò)爬蟲 出處:《哈爾濱工業(yè)大學(xué)》2015年碩士論文
【摘要】:某公司是中國領(lǐng)先的在線旅游平臺,機(jī)票搜索交易平臺是其中的重要基礎(chǔ)平臺之一,搜索范圍覆蓋全球范圍內(nèi)約18余萬條航線,可實(shí)時(shí)搜索4000多家旅游代理商網(wǎng)站,同時(shí)其2014年度的機(jī)票交易也突破了8000萬張。然而在業(yè)務(wù)量持續(xù)增長的同時(shí),機(jī)票搜索交易平臺及其相關(guān)業(yè)務(wù)系統(tǒng)都面臨著各類外部來源的信息抓取所帶來的壓力,大量的抓取請求帶來了一系列嚴(yán)峻的問題:○1數(shù)據(jù)安全問題,面對非正常的抓取訪問的,關(guān)鍵數(shù)據(jù)存在被競爭對手獲取的風(fēng)險(xiǎn);○2系統(tǒng)性能問題,大量的抓取請求造成服務(wù)器資源的耗盡,嚴(yán)重影響用戶的搜索和交易體驗(yàn);○3不同的業(yè)務(wù)系統(tǒng)重復(fù)對防抓取進(jìn)行實(shí)現(xiàn),且實(shí)現(xiàn)質(zhì)量良莠不齊,形成了資源的浪費(fèi)。論文通過對網(wǎng)絡(luò)爬蟲和防抓取相關(guān)技術(shù)的深入研究,設(shè)計(jì)并實(shí)現(xiàn)了網(wǎng)頁防抓取系統(tǒng)(Web Anti-Crawling System,ACS)。ACS系統(tǒng)為公司的機(jī)票搜索交易平臺及其下面的多個(gè)業(yè)務(wù)項(xiàng)目提供了統(tǒng)一的、高質(zhì)量的防抓取服務(wù),實(shí)現(xiàn)了HTTP協(xié)議頭、JS加密串、IP黑名單、訪問頻率控制等防抓取策略;通過對機(jī)票搜索交易平臺業(yè)務(wù)的深入了解,實(shí)現(xiàn)了業(yè)務(wù)邏輯相關(guān)的行為模式防抓取策略,進(jìn)一步提高了抓取所需的成本;另外,ACS系統(tǒng)對策略接口、防抓取服務(wù)接口的設(shè)計(jì),使得API接口與實(shí)現(xiàn)分離,不僅具有良好的拓展性,同時(shí)也降低與業(yè)務(wù)系統(tǒng)之間的耦合性,便于防抓取服務(wù)的接入。Anti-Crawling System為上述由抓取帶來的問題提供了一個(gè)解決方案。整個(gè)防抓取系統(tǒng)經(jīng)過一定的功能測試和性能測試,確定論文中所述的五個(gè)防抓取策略已經(jīng)可以正常工作,滿足系統(tǒng)預(yù)期的功能需求;ACS系統(tǒng)與其他業(yè)務(wù)系統(tǒng)耦合度低,非常易于防抓取服務(wù)的接入;同時(shí)在性能測試過程中,整個(gè)防抓取系統(tǒng)能夠穩(wěn)定地提供服務(wù)且能達(dá)到預(yù)期的性能要求。目前ACS系統(tǒng)已經(jīng)正式投入實(shí)際使用和運(yùn)行。
[Abstract]:A company is a leading online travel platform in China, and the ticket search and transaction platform is one of the important basic platforms. The search scope covers more than 180,000 routes around the world, and it can search more than 4000 travel agent websites in real time.At the same time, its 2014 air ticket transactions also broke through 80 million.However, while the volume of business continues to grow, ticket search and transaction platforms and their related business systems are facing the pressure of information capture from all kinds of external sources.A large number of fetching requests have brought a series of serious problems: 01 data security problems. Faced with abnormal grab access, critical data has the risk of being acquired by competitors.A large number of crawling requests lead to the exhaustion of server resources, which seriously affect the user's search and transaction experience. Different business systems repeat the implementation of anti-grab, and the quality of the implementation is uneven, resulting in a waste of resources.Based on the deep research of web crawler and anti-grabbing technology, this paper designs and implements the web Anti-Crawling system ACS.ACS system provides a unified platform for the airline ticket search and transaction platform and several business items below it.The high quality anti-grab service realizes the anti-grab strategy of HTTP protocol, such as JS encryption, IP blacklist, access frequency control and so on, through in-depth understanding of the business of air ticket search and transaction platform,In addition, the design of the policy interface and the anti-grab service interface of the API system makes the API interface separate from the implementation.It not only has good expansibility, but also reduces the coupling with the service system. It is convenient to access. Anti-Crawling System to provide a solution for the above problems caused by the grab.After a certain function test and performance test, the whole anti-grab system determines that the five anti-grab strategies mentioned in the paper can work normally, and meet the expected functional requirements of the system, and the coupling degree between ACS system and other business systems is low.It is very easy to access the anti-grab service, and in the process of performance testing, the whole anti-grab system can provide the service stably and meet the expected performance requirements.At present, ACS system has been put into practical use and operation.
【學(xué)位授予單位】:哈爾濱工業(yè)大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2015
【分類號】:TP393.092
【參考文獻(xiàn)】
相關(guān)期刊論文 前4條
1 范純龍;袁濱;余周華;徐蕾;;基于陷阱技術(shù)的網(wǎng)絡(luò)爬蟲檢測[J];計(jì)算機(jī)應(yīng)用;2010年07期
2 梁雪松;張容;;網(wǎng)絡(luò)爬蟲對網(wǎng)絡(luò)安全的影響及其對策分析[J];計(jì)算機(jī)與數(shù)字工程;2009年12期
3 周中華;張惠然;謝江;;基于Python的新浪微博數(shù)據(jù)爬蟲[J];計(jì)算機(jī)應(yīng)用;2014年11期
4 李璐;張國印;李正文;;基于SVM的主題爬蟲技術(shù)研究[J];計(jì)算機(jī)科學(xué);2015年02期
相關(guān)碩士學(xué)位論文 前4條
1 宋婷;基于SVM的網(wǎng)絡(luò)爬蟲檢測研究與實(shí)現(xiàn)[D];天津大學(xué);2010年
2 劉嘯;基于Cookie欺騙的Session滲透入侵分析及其安全模型研究[D];浙江大學(xué);2003年
3 蘇旋;分布式網(wǎng)絡(luò)爬蟲技術(shù)的研究與實(shí)現(xiàn)[D];哈爾濱工業(yè)大學(xué);2006年
4 林樂彬;Inar網(wǎng)絡(luò)爬蟲的設(shè)計(jì)與實(shí)現(xiàn)[D];哈爾濱工業(yè)大學(xué);2006年
,本文編號:1704044
本文鏈接:http://sikaile.net/guanlilunwen/ydhl/1704044.html