當(dāng)前位置：主頁(yè) > 管理論文 > 移動(dòng)網(wǎng)絡(luò)論文 >

基于數(shù)據(jù)挖掘的惡意網(wǎng)站檢測(cè)技術(shù)研究

發(fā)布時(shí)間：2019-06-25 10:03

【摘要】：隨著互聯(lián)網(wǎng)的發(fā)展,網(wǎng)絡(luò)安全日益受到人們關(guān)注。惡意網(wǎng)站攻擊事件的頻繁發(fā)生,給用戶帶來(lái)了巨大的財(cái)產(chǎn)損失,同時(shí)也嚴(yán)重威脅了個(gè)人甚至國(guó)家的安全。因此,建立一定的模型,并對(duì)惡意網(wǎng)站進(jìn)行識(shí)別和檢測(cè)具有非常重要的意義。目前國(guó)內(nèi)外很多學(xué)者對(duì)特征選擇方法進(jìn)行了改進(jìn),他們多集中在對(duì)主機(jī)特征與詞匯特征兩個(gè)方面進(jìn)行深入挖掘與改進(jìn),但是仍然存在準(zhǔn)確率與效率不高的情況。針對(duì)這些問(wèn)題,在特征提取問(wèn)題上,本文首先提出了建立易受攻擊網(wǎng)站名單的概念,并在此基礎(chǔ)上提出了基于加權(quán)距離的新特征提取方案。同時(shí)在數(shù)據(jù)挖掘算法上本文基于改進(jìn)的模糊C均值聚類算法對(duì)KNN模型進(jìn)行改進(jìn),提高了模型的效率。本文的研究工作主要包括:數(shù)據(jù)采集:本文對(duì)正常網(wǎng)站和惡意網(wǎng)站的數(shù)據(jù)分別進(jìn)行爬取,清洗,標(biāo)準(zhǔn)化處理與入庫(kù)操作,最終把數(shù)據(jù)放到MySQL數(shù)據(jù)庫(kù)中。特征提取:異于常見(jiàn)的網(wǎng)站白名單、網(wǎng)站黑名單的概念,文中把容易被攻擊的的網(wǎng)站進(jìn)行匯總,提出了建立易受攻擊網(wǎng)站名單的概念。同時(shí)惡意網(wǎng)站通常在正常網(wǎng)站的基礎(chǔ)上進(jìn)行一定程度的更改,根據(jù)更改類型設(shè)定不同的權(quán)重,提出了加權(quán)距離的概念,對(duì)任一輸入U(xiǎn)RL計(jì)算其與易受攻擊網(wǎng)站名單中URL間的最近加權(quán)距離距離,并把它作為新的特征。模型改進(jìn):本文首先對(duì)KNN算法和模糊C均值算法進(jìn)行了改進(jìn),針對(duì)FCM初始聚類中心不確定,容易陷入局部最優(yōu)的缺點(diǎn),本文提出了坐標(biāo)密度法,確定初始聚類中心。針對(duì)FCM算法的初始聚類個(gè)數(shù)隨機(jī)選取的問(wèn)題提出了運(yùn)用K值和數(shù)據(jù)集個(gè)數(shù)來(lái)確定的方法,最終獲取樣本的聚類中心和聚類中心所在的簇。通過(guò)找到距離測(cè)試集距離最小的聚類中心所在簇,來(lái)確定測(cè)試集的類別。模型驗(yàn)證:本文采用了 LR模型,J48模型以及改進(jìn)的KNN模型,運(yùn)用WEKA對(duì)數(shù)據(jù)進(jìn)行分類。同時(shí)把加入新特征的數(shù)據(jù)和運(yùn)用原始特征的數(shù)據(jù)運(yùn)用數(shù)據(jù)挖掘算法進(jìn)行分類及準(zhǔn)確性對(duì)比,最終,分類結(jié)果得到一定提高。同時(shí)和其他文獻(xiàn)中方法進(jìn)行對(duì)比,發(fā)現(xiàn)特征具有較好的效果。
[Abstract]:With the development of the Internet, network security has been paid more and more attention. The frequent occurrence of malicious website attacks has brought huge property losses to users, but also seriously threatened the security of individuals and even countries. Therefore, it is of great significance to establish a certain model and identify and detect malicious websites. At present, many scholars at home and abroad have improved the feature selection methods, most of them focus on the host features and lexical features of the two aspects of in-depth mining and improvement, but there are still low accuracy and efficiency. In order to solve these problems, in this paper, the concept of establishing the list of vulnerable websites is proposed, and a new feature extraction scheme based on weighted distance is proposed. At the same time, in the data mining algorithm, this paper improves the KNN model based on the improved fuzzy C-means clustering algorithm, and improves the efficiency of the model. The research work of this paper mainly includes: data acquisition: this paper crawls, cleans, standardizes and stores the data of normal website and malicious website respectively, and finally puts the data into MySQL database. Feature extraction: different from the common concepts of website whitelist and website blacklist, this paper summarizes the vulnerable websites and puts forward the concept of establishing vulnerable website lists. At the same time, malicious websites usually change to a certain extent on the basis of normal websites. According to the different weights of the change types, the concept of weighted distance is put forward, and the nearest weighted distance between malicious websites and URL in the list of vulnerable sites is calculated for any input URL, and it is regarded as a new feature. Model improvement: in this paper, the KNN algorithm and fuzzy C-means algorithm are improved. In order to solve the problem that the initial clustering center of FCM is uncertain and easy to fall into local optimization, the coordinate density method is proposed to determine the initial clustering center. In order to solve the problem of random selection of the initial clustering number of FCM algorithm, a method is proposed to determine the K value and the number of data sets. Finally, the clustering center of the sample and the cluster in which the clustering center is located are obtained. By finding the cluster with the smallest distance from the test set, the category of the test set is determined. Model verification: in this paper, LR model, J48 model and improved KNN model are used to classify the data by WEKA. At the same time, the data with new features and the data using original features are compared with the data mining algorithm. Finally, the classification results are improved to a certain extent. At the same time, compared with other methods in the literature, it is found that the characteristics have better results.
【學(xué)位授予單位】：北京郵電大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2017
【分類號(hào)】：TP393.092;TP311.13

【參考文獻(xiàn)】

相關(guān)期刊論文前10條

1 周慶平;譚長(zhǎng)庚;王宏君;湛淼湘;;基于聚類改進(jìn)的KNN文本分類算法[J];計(jì)算機(jī)應(yīng)用研究;2016年11期

2 陳莊;劉龍飛;;融合域名注冊(cè)信息的惡意網(wǎng)站檢測(cè)方法研究[J];計(jì)算機(jī)光盤軟件與應(yīng)用;2015年01期

3 曹玖新;董丹;毛波;王田峰;;基于URL特征的Phishing檢測(cè)方法(英文)[J];Journal of Southeast University(English Edition);2013年02期

4 李洋;劉飚;封化民;;基于機(jī)器學(xué)習(xí)的網(wǎng)頁(yè)惡意代碼檢測(cè)方法[J];北京電子科技學(xué)院學(xué)報(bào);2012年04期

5 劉喜梅;雷達(dá);;一種改進(jìn)的模糊C均值聚類算法[J];青島科技大學(xué)學(xué)報(bào)(自然科學(xué)版);2011年02期

6 胡明;劉嘉勇;劉亮;;一種基于代碼特征的網(wǎng)頁(yè)木馬改良模型研究[J];通信技術(shù);2010年08期

7 張孝飛;黃河燕;;一種采用聚類技術(shù)改進(jìn)的KNN文本分類方法[J];模式識(shí)別與人工智能;2009年06期

8 呂曉燕;羅立民;李祥生;;FCM算法的改進(jìn)及仿真實(shí)驗(yàn)研究[J];計(jì)算機(jī)工程與應(yīng)用;2009年20期

9 張慧哲;王堅(jiān);;基于初始聚類中心選取的改進(jìn)FCM聚類算法[J];計(jì)算機(jī)科學(xué);2009年06期

10 吳潤(rùn)浦;方勇;吳少華;;基于統(tǒng)計(jì)與代碼特征分析的網(wǎng)頁(yè)木馬檢測(cè)模型[J];信息與電子工程;2009年01期

相關(guān)會(huì)議論文前1條

1 劉琪;牛文靜;;正則表達(dá)式在惡意代碼動(dòng)態(tài)分析中的應(yīng)用[A];2009通信理論與技術(shù)新發(fā)展——第十四屆全國(guó)青年通信學(xué)術(shù)會(huì)議論文集[C];2009年

相關(guān)博士學(xué)位論文前2條

1 汪慶淼;基于目標(biāo)函數(shù)的模糊聚類新算法及其應(yīng)用研究[D];江蘇大學(xué);2014年

2 張健毅;大規(guī)模反釣魚識(shí)別引擎關(guān)鍵技術(shù)研究[D];北京郵電大學(xué);2012年

相關(guān)碩士學(xué)位論文前2條

1 趙茉莉;網(wǎng)絡(luò)爬蟲(chóng)系統(tǒng)的研究與實(shí)現(xiàn)[D];電子科技大學(xué);2013年

2 王穎杰;基于惡意網(wǎng)頁(yè)檢測(cè)的蜜罐系統(tǒng)研究[D];南京師范大學(xué);2008年

，

本文編號(hào)：2505599

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/guanlilunwen/ydhl/2505599.html

上一篇：RSA融合AES算法的網(wǎng)絡(luò)信息安全方法
下一篇：基于歷史經(jīng)驗(yàn)的動(dòng)態(tài)信任融合模型

論文發(fā)表

·知網(wǎng)|萬(wàn)方|維普|龍?jiān)磡省級(jí)|國(guó)家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于數(shù)據(jù)挖掘的惡意網(wǎng)站檢測(cè)技術(shù)研究