基于行為模式的Web Robot檢測技術(shù)研究
本文選題:網(wǎng)絡(luò)爬蟲檢測 切入點(diǎn):行為模式 出處:《武漢郵電科學(xué)研究院》2017年碩士論文
【摘要】:Web Robot(網(wǎng)絡(luò)爬蟲)是一種能夠自動地獲取各類互聯(lián)網(wǎng)資源的程序,自1993年被正式應(yīng)用后,給普通用戶和專業(yè)互聯(lián)網(wǎng)從業(yè)人員都帶來了便利。伴隨著Web Robot的出現(xiàn),人們才具備在日益增長的互聯(lián)網(wǎng)數(shù)據(jù)中進(jìn)行有目的地檢索的能力。而互聯(lián)網(wǎng)技術(shù)不斷發(fā)展,已經(jīng)全面地融入到社會的各個方面,互聯(lián)網(wǎng)上的數(shù)據(jù)量也在高速增加,為了滿足人們不同的需求,網(wǎng)絡(luò)爬蟲技術(shù)也在不斷更新。通常來說可以分為通用Robot、聚焦型Robot、增量式Robot、Deep Robot、Topic Robot以及分布式Robot。在實(shí)際使用中,大型的網(wǎng)絡(luò)爬蟲系統(tǒng)往往會融合幾種技術(shù)以共同實(shí)現(xiàn),使得其架構(gòu)和行為變得日益復(fù)雜。然而,在其被人們大量地被應(yīng)用到檢索網(wǎng)絡(luò)信息和資源的同時,也產(chǎn)生了隱患和負(fù)面效果。Web Robot會頻繁地嘗試獲取網(wǎng)站上的各類資源,這會影響網(wǎng)站服務(wù)器的性能并且會產(chǎn)生信息泄露的風(fēng)險;其次,爬蟲程序?qū)W(wǎng)站的訪問會影響網(wǎng)站日志,進(jìn)而影響基于網(wǎng)站日志的數(shù)據(jù)挖掘工作的難度和準(zhǔn)確度;此外,出于惡意目的(如窺探網(wǎng)站漏洞或竊取網(wǎng)站信息)而設(shè)計(jì)的Robot程序會造成隱私數(shù)據(jù)泄露、資源濫用等問題。為了解決這些問題,互聯(lián)網(wǎng)工作者開發(fā)出了許多Web Robot檢測技術(shù),使得網(wǎng)站的開發(fā)人員能夠檢測客戶端是普通用戶還是Robot程序。為了進(jìn)一步提高對Web Robot的檢測效果,彌補(bǔ)現(xiàn)有檢測手段的不足,本文采用會話矢量描述Web Robot的行為模式,實(shí)現(xiàn)了一種基于Web Robot行為特征的檢測算法。主要內(nèi)容有:通過針對Web Robot的設(shè)計(jì)原理行為模式等方面的分析,詳細(xì)介紹了其他檢測算法的優(yōu)劣;介紹了行為矢量的原理,分析方法,及其在各個領(lǐng)域的應(yīng)用;設(shè)計(jì)基于支持矢量機(jī)的Web Robot檢測算法,對其有效性進(jìn)行分析,并在實(shí)驗(yàn)中完成了測試。論文創(chuàng)新點(diǎn)在于:針對網(wǎng)絡(luò)爬蟲的行為特征,對Web日志進(jìn)行聚類分析,提取出能夠標(biāo)記Web訪問會話的特征矢量,并對此做出改進(jìn),給出了特征矢量權(quán)值的計(jì)算方法及改進(jìn)的權(quán)值公式。在基于支持矢量機(jī)的爬蟲檢測算法的基礎(chǔ)上設(shè)計(jì)實(shí)現(xiàn)了基于行為模式的爬蟲檢測系統(tǒng),并對其系統(tǒng)架構(gòu)及模塊設(shè)計(jì)進(jìn)行了詳細(xì)描述。
[Abstract]:Web Robot (Web crawler) is a program that can automatically access all kinds of Internet resources. Since its formal application in 1993, it has brought convenience to both ordinary users and professional Internet practitioners.With the appearance of Web Robot, people have the ability to retrieve data from the Internet.With the continuous development of Internet technology, it has been fully integrated into all aspects of society, and the amount of data on the Internet is also increasing at a high speed. In order to meet the different needs of people, the technology of web crawler is constantly updated.Generally speaking, it can be divided into general robot, focused robot, incremental robot deep robot topic Robot and distributed robot.In practical use, large web crawler systems tend to integrate several technologies to implement them together, which makes their architecture and behavior more and more complex.However, while it is widely used to retrieve network information and resources, it also produces hidden dangers and negative effects. The web Robot will frequently try to obtain all kinds of resources on the website.This affects the performance of the web server and the risk of information disclosure; secondly, the crawler's access to the site affects the site log, which in turn affects the difficulty and accuracy of the data mining based on the web log.A Robot program designed for malicious purposes (such as peeping into a vulnerability or stealing information from a website) can cause privacy data leaks, resource abuse and so on.In order to solve these problems, Internet workers have developed many Web Robot detection techniques, which enable web developers to detect whether the client is an ordinary user or a Robot program.In order to further improve the detection effect of Web Robot and make up for the deficiency of existing detection methods, this paper uses session vector to describe the behavior pattern of Web Robot, and implements a detection algorithm based on the behavior characteristics of Web Robot.The main contents are as follows: through the analysis of the design principle and behavior pattern of Web Robot, the advantages and disadvantages of other detection algorithms are introduced in detail, the principle of behavior vector, the analysis method and its application in various fields are introduced.The Web Robot detection algorithm based on support vector machine is designed, and its validity is analyzed and tested in the experiment.The innovation of this paper lies in: according to the behavior characteristics of web crawlers, clustering analysis of Web logs is carried out to extract feature vectors that can mark Web access sessions, and make improvements to this.The method of calculating the weight of feature vector and the improved formula of weight are given.Based on the crawler detection algorithm based on support vector machine, the crawler detection system based on behavior pattern is designed and implemented, and the system architecture and module design are described in detail.
【學(xué)位授予單位】:武漢郵電科學(xué)研究院
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2017
【分類號】:TP393.092;TP391.3
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 閔鈺麟;黃永峰;;用戶定制主題聚焦爬蟲的設(shè)計(jì)與實(shí)現(xiàn)[J];計(jì)算機(jī)工程與設(shè)計(jì);2015年01期
2 楊華權(quán);;論爬蟲協(xié)議對互聯(lián)網(wǎng)競爭關(guān)系的影響[J];知識產(chǎn)權(quán);2014年01期
3 文志強(qiáng);胡永祥;朱文球;;流形上的k最近鄰分類方法[J];計(jì)算機(jī)應(yīng)用;2012年12期
4 石洪波;柳亞琴;李愛軍;;貝葉斯分類器的判別式參數(shù)學(xué)習(xí)[J];計(jì)算機(jī)應(yīng)用;2011年04期
5 周德懋;李舟軍;;高性能網(wǎng)絡(luò)爬蟲:研究綜述[J];計(jì)算機(jī)科學(xué);2009年08期
6 馬永軍,李孝忠,王希雷;基于模糊支持向量機(jī)和核方法的目標(biāo)檢測方法研究[J];天津科技大學(xué)學(xué)報(bào);2005年03期
7 李盼池,許少華;支持向量機(jī)在模式識別中的核函數(shù)特性分析[J];計(jì)算機(jī)工程與設(shè)計(jì);2005年02期
8 孫紅衛(wèi),于朝霞;Mercer定理的推廣[J];濟(jì)南大學(xué)學(xué)報(bào)(自然科學(xué)版);2004年03期
9 朱永生,張優(yōu)云;支持向量機(jī)分類器中幾個問題的研究[J];計(jì)算機(jī)工程與應(yīng)用;2003年13期
10 林士敏;田鳳占;陸玉昌;;用于數(shù)據(jù)采掘的貝葉斯分類器研究[J];計(jì)算機(jī)科學(xué);2000年10期
,本文編號:1725506
本文鏈接:http://sikaile.net/guanlilunwen/ydhl/1725506.html