當(dāng)前位置：主頁 > 管理論文 > 移動網(wǎng)絡(luò)論文 >

面向不均衡數(shù)據(jù)的半監(jiān)督網(wǎng)絡(luò)流量分類技術(shù)研究與實現(xiàn)

發(fā)布時間：2018-12-20 07:52

【摘要】：網(wǎng)絡(luò)流量分類技術(shù)作為網(wǎng)絡(luò)業(yè)務(wù)管控、網(wǎng)絡(luò)安全以及網(wǎng)絡(luò)的建設(shè)升級、運營管理等課題的基礎(chǔ),其研究具有重要的應(yīng)用價值。隨著網(wǎng)絡(luò)技術(shù)的飛速發(fā)展,網(wǎng)絡(luò)用戶數(shù)量急劇膨脹,網(wǎng)絡(luò)規(guī)模迅速擴大,新業(yè)務(wù)不斷涌現(xiàn),導(dǎo)致網(wǎng)絡(luò)環(huán)境日趨復(fù)雜,使得對網(wǎng)絡(luò)流量的準(zhǔn)確分類變的越來越困難。尤其是隨著以動態(tài)端口號和業(yè)務(wù)加密為代表的網(wǎng)絡(luò)技術(shù)的廣泛應(yīng)用,傳統(tǒng)的基于端口號和載荷特征匹配的流量識別方法的有效性和可靠性下降,研究者們將研究重點放到了基于機器學(xué)習(xí)的流量分類方法。此類方法根據(jù)流的統(tǒng)計特征進行分類,擺脫了對端口號及數(shù)據(jù)載荷的依賴,有更加廣泛的發(fā)展前景。本文針對基于機器學(xué)習(xí)流量分類領(lǐng)域中的樣本標(biāo)注瓶頸和類不均衡兩個關(guān)鍵問題展開了研究。主要完成工作如下:1.針對流量分類中的樣本標(biāo)注瓶頸問題和類不均衡問題,提出一種基于K均值和k近鄰的半監(jiān)督流量分類算法(semi-supervised traffic identification method based on K-means and k-nearest neighbor,KMkNN)。該方法以高維流統(tǒng)計特征矢量表征數(shù)據(jù)流,采用K均值和k近鄰分類算法構(gòu)建兩級分類器。首先采用K均值聚類算法將包含少量標(biāo)記樣本和大量未標(biāo)記樣本的數(shù)據(jù)聚成若干簇;然后,利用簇中標(biāo)記樣本訓(xùn)練k近鄰分類器對簇內(nèi)未知樣本分類,并基于已標(biāo)記樣本分布自適應(yīng)調(diào)整近鄰數(shù)k,從而克服了傳統(tǒng)半監(jiān)督流量分類方法分類結(jié)果傾向于大類,小類樣本識別率低甚至難以被發(fā)現(xiàn)的問題。理論分析和實驗結(jié)果都表明,該方法面對非均衡協(xié)議流時在保持大類流具有較高識別率的同時提高了小類流的識別率,且能夠發(fā)現(xiàn)新應(yīng)用。2.針對流統(tǒng)計特征存在冗余、可劃分為多個相對獨立的特征子集的情況,提出一種基于隨機特征子集的集成流量分類算法(ensemble classifier based on random subspace,RSEC)。該算法首先采用基于前向選擇的wrapper方式進行特征選擇構(gòu)建特征集合,然后采用分階隨機選擇的方法生成特征子集,進而根據(jù)不同的特征子集訓(xùn)練獲得不同的基分類器,最后采取絕對多數(shù)與相對多數(shù)相結(jié)合的投票方式集成各個基分類器的分類結(jié)果得到最終集成結(jié)果。實驗結(jié)果表明該算法對大類和小類流量的識別準(zhǔn)確率和召回率相對于單分類器KMkNN有了進一步提升。3.結(jié)合實際的網(wǎng)絡(luò)環(huán)境,設(shè)計了一種基于機器學(xué)習(xí)的離線流量分類系統(tǒng),并采用C#語言編程實現(xiàn)。系統(tǒng)利用wireshark軟件實現(xiàn)在線數(shù)據(jù)采集并保存到本地,用于離線分析;流特征集生成模塊根據(jù)五元組信息對流進行還原,并通過統(tǒng)計報文頭部信息得到流特征;樣本標(biāo)注模塊結(jié)合端口號匹配、載荷特征匹配和手工標(biāo)注等手段標(biāo)注訓(xùn)練樣本;分類模塊提供了C4.5、NBK、半監(jiān)督K-means以及本文提出的KMkNN、RSEC共五類可選的分類算法;最后利用實驗室采集的真實數(shù)據(jù)對系統(tǒng)進行測試,驗證了系統(tǒng)的有效性。
[Abstract]:Network traffic classification technology is the basis of network management and control, network security, network construction and upgrading, operation management and so on, and its research has important application value. With the rapid development of network technology, the number of network users expands rapidly, the scale of network expands rapidly, and new services emerge constantly. As a result, the network environment is becoming more and more complex, and it is becoming more and more difficult to classify network traffic accurately. Especially, with the wide application of network technology represented by dynamic port number and service encryption, the effectiveness and reliability of the traditional traffic identification method based on port number and load feature matching are decreased. The researchers focused their research on traffic classification based on machine learning. This kind of method can be classified according to the statistical characteristics of the stream, and it can get rid of the dependence on the port number and data load, so it has a wider development prospect. In this paper, two key problems of sample tagging bottleneck and class imbalance in the field of traffic classification based on machine learning are studied. The main work is as follows: 1. A semi-supervised traffic classification algorithm (semi-supervised traffic identification method based on K-means and k-nearest neighbor,KMkNN) based on K-means and k-nearest neighbors is proposed to solve the bottleneck problem and class imbalance problem in traffic classification. The data stream is represented by high dimensional flow statistical feature vector, and a two-level classifier is constructed by using K-means and k-nearest neighbor classification algorithms. Firstly, K-means clustering algorithm is used to cluster the data containing a small number of labeled samples and a large number of unlabeled samples into several clusters. Then, k-nearest neighbor classifier is used to train k-nearest neighbor classifier to classify unknown samples in the cluster, and based on the distribution of labeled samples, the nearest neighbor number k is adjusted adaptively, which overcomes the traditional semi-supervised traffic classification method. Small class sample recognition rate is low or even difficult to find the problem. The theoretical analysis and experimental results show that the proposed method not only maintains a high recognition rate of large class flows, but also improves the recognition rate of small class flows, and can find new applications. 2. An integrated traffic classification algorithm (ensemble classifier based on random subspace,RSEC) based on stochastic feature subsets is proposed to solve the problem that there is redundancy in flow statistics and can be divided into several independent feature subsets. In this algorithm, feature sets are constructed by feature selection based on forward selection (wrapper), then feature subsets are generated by hierarchical random selection, and different base classifiers are obtained by training different feature subsets. Finally, the final result is obtained by combining absolute majority and relative majority to integrate the classification results of each base classifier. Experimental results show that the recognition accuracy and recall rate of the proposed algorithm for large and small class traffic are further improved compared with single classifier KMkNN. An off-line traffic classification system based on machine learning is designed and implemented in C # language. The system uses wireshark software to realize the on-line data acquisition and save to the local for off-line analysis, the flow feature set generation module restores according to the five-tuple information convection, and obtains the flow feature through the statistical message header information. The sample tagging module uses port number matching, load feature matching and manual marking to mark the training samples, and the classification module provides five optional classification algorithms, C4.5 NBK, semi-supervised K-means and KMkNN,RSEC proposed in this paper. Finally, the validity of the system is verified by using the real data collected in the laboratory.
【學(xué)位授予單位】：解放軍信息工程大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2014
【分類號】：TP393.06

【參考文獻】

相關(guān)期刊論文前10條

1 陶新民;郝思媛;張冬雪;徐鵬;;不均衡數(shù)據(jù)分類算法的綜述[J];重慶郵電大學(xué)學(xué)報(自然科學(xué)版);2013年01期

2 張震;汪斌強;李向濤;黃萬偉;;基于近鄰傳播學(xué)習(xí)的半監(jiān)督流量分類方法[J];自動化學(xué)報;2013年07期

3 楊明;王飛;;一種基于局部隨機子空間的分類集成算法[J];模式識別與人工智能;2012年04期

4 張宏莉;魯剛;;分類不平衡協(xié)議流的機器學(xué)習(xí)算法評估與比較[J];軟件學(xué)報;2012年06期

5 王濤;余順爭;;基于機器學(xué)習(xí)的網(wǎng)絡(luò)流量分類研究進展[J];小型微型計算機系統(tǒng);2012年05期

6 鈕曉娜;郭云飛;張進;;基于機器自學(xué)習(xí)的互聯(lián)網(wǎng)加密業(yè)務(wù)流早期識別[J];計算機工程與設(shè)計;2010年02期

7 王宇;余順爭;;網(wǎng)絡(luò)流量的決策樹分類[J];小型微型計算機系統(tǒng);2009年11期

8 徐鵬;林森;;基于C4.5決策樹的流量分類方法[J];軟件學(xué)報;2009年10期

9 徐鵬;劉瓊;林森;;基于支持向量機的Internet流量分類研究[J];計算機研究與發(fā)展;2009年03期

10 柳斌;李之棠;涂浩;;一種基于半監(jiān)督學(xué)習(xí)的應(yīng)用層流量分類方法[J];微電子學(xué)與計算機;2008年10期

，

本文編號：2387678

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/guanlilunwen/ydhl/2387678.html

上一篇：防火墻配置規(guī)則沖突檢測關(guān)鍵技術(shù)研究
下一篇：網(wǎng)絡(luò)流量管理系統(tǒng)設(shè)計與實現(xiàn)

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

面向不均衡數(shù)據(jù)的半監(jiān)督網(wǎng)絡(luò)流量分類技術(shù)研究與實現(xiàn)