網(wǎng)絡(luò)流量測(cè)量中基于計(jì)數(shù)的頻繁項(xiàng)挖掘算法研究
發(fā)布時(shí)間:2018-03-12 20:51
本文選題:數(shù)據(jù)流 切入點(diǎn):頻繁項(xiàng) 出處:《燕山大學(xué)》2014年碩士論文 論文類型:學(xué)位論文
【摘要】:大流識(shí)別是在網(wǎng)絡(luò)流量測(cè)量數(shù)據(jù)流中查詢頻繁項(xiàng)的過程,近些年成為了數(shù)據(jù)流挖掘領(lǐng)域的一個(gè)研究熱點(diǎn)。網(wǎng)絡(luò)流量數(shù)據(jù)流具有高速、不確定、時(shí)變等特征,并且大流識(shí)別對(duì)查詢結(jié)果的精度要求較高,對(duì)算法的時(shí)空開銷也有一定限制。故目前的數(shù)據(jù)流頻繁項(xiàng)挖掘算法仍存在識(shí)別效率低、處理實(shí)時(shí)性差、性能亟待改進(jìn)等問題。 首先,針對(duì)PLC(Probabilistic Lossy Counting)算法的計(jì)算公式和清除策略,進(jìn)一步優(yōu)化時(shí)空開銷,提出了一種基于計(jì)數(shù)的空間優(yōu)化的大流識(shí)別算法DPLC(Delta Probabilistic Lossy Counting)。該算法引入變化度和穩(wěn)態(tài)值兩個(gè)概念,來(lái)衡量計(jì)算出的窗口誤差值是否達(dá)到穩(wěn)態(tài),達(dá)到穩(wěn)態(tài)后可以省去計(jì)算公式的時(shí)間,從而降低時(shí)間開銷。此外,邊界刪除時(shí)采用更嚴(yán)格的誤差值,區(qū)分對(duì)待是否使用計(jì)算出的窗口誤差值替換表?xiàng)l目的估計(jì)值,,可以有效降低存儲(chǔ)開銷。 其次,針對(duì)LC(Lossy Counting)算法存儲(chǔ)開銷大且識(shí)別精度低,而PLC算法識(shí)別精度較高但可能出現(xiàn)漏報(bào)的問題,進(jìn)一步優(yōu)化算法性能,提出了一種非線性有損計(jì)數(shù)的大流識(shí)別算法NLC(Nonlinear Lossy Counting)。分別采用非線性函數(shù)作為統(tǒng)計(jì)值和刪除值,通過有效控制算法的存儲(chǔ)開銷來(lái)降低輸出結(jié)果的誤報(bào)和避免漏報(bào)。值得注意的是,非線性函數(shù)的參數(shù)值選取是否合適對(duì)NLC算法的性能發(fā)揮有影響,實(shí)驗(yàn)分析總結(jié)了非線性函數(shù)的參數(shù)值與包分布、支持度之間的變化規(guī)律。設(shè)置不同參數(shù)值的非線性函數(shù)可以改變NLC算法的存儲(chǔ)開銷,進(jìn)而改變了識(shí)別結(jié)果的誤報(bào)率和漏報(bào)率。 最后,通過實(shí)驗(yàn)使用人工合成數(shù)據(jù)集和真實(shí)網(wǎng)絡(luò)流量數(shù)據(jù)集評(píng)估算法的性能。實(shí)驗(yàn)結(jié)果展示了本文提出的算法在識(shí)別精度、時(shí)間開銷、存儲(chǔ)開銷等方面的優(yōu)勢(shì)。本文實(shí)驗(yàn)均在MyEclipse平臺(tái)下開發(fā),用java語(yǔ)言實(shí)現(xiàn)。
[Abstract]:Large flow identification is the process of querying frequent items in the network traffic measurement data stream. In recent years, it has become a research hotspot in the field of data stream mining. The network traffic data stream has the characteristics of high speed, uncertainty, time-varying and so on. Large stream recognition requires high precision of query results and also has some limitations on the space-time overhead of the algorithm. Therefore, the current algorithms for frequent item mining of data streams still have some problems such as low recognition efficiency, poor real-time processing, and the performance needs to be improved. First of all, aiming at the calculation formula and clearing strategy of PLC(Probabilistic Lossy counting algorithm, and further optimizing space-time overhead, a large flow recognition algorithm DPLC(Delta Probabilistic Lossy counting algorithm based on counting is proposed, which introduces the concepts of degree of variation and steady-state value. To measure whether the calculated window error is steady-state, the time of the formula can be saved and the time cost can be reduced. In addition, a more strict error is used when the boundary is deleted. It can effectively reduce the storage overhead by distinguishing whether to replace the estimated values of table entries with the calculated window error values. Secondly, aiming at the problem that the LC(Lossy counting algorithm has high storage cost and low recognition accuracy, but the PLC algorithm has high recognition accuracy but may be underreported, the performance of the algorithm is further optimized. In this paper, a large flow recognition algorithm for nonlinear lossy counting, NLC(Nonlinear Lossy counting algorithm, is proposed. The nonlinear function is used as the statistical value and the deletion value respectively. By effectively controlling the storage cost of the algorithm, the false positives of the output results are reduced and the false positives are avoided. Whether the parameter value of the nonlinear function is suitable or not has an effect on the performance of the NLC algorithm. The parameter value and the packet distribution of the nonlinear function are analyzed and summarized in the experiment. The nonlinear function with different parameter values can change the storage cost of NLC algorithm and then change the false alarm rate and false alarm rate of the recognition results. Finally, the performance of the algorithm is evaluated by using artificial data sets and real network traffic datasets. The experimental results show that the proposed algorithm has high recognition accuracy and time overhead. The experiments in this paper are developed on MyEclipse platform and realized by java language.
【學(xué)位授予單位】:燕山大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2014
【分類號(hào)】:TP393.06
【參考文獻(xiàn)】
相關(guān)期刊論文 前5條
1 ;Identifying heavy hitters in high-speed network monitoring[J];Science China(Information Sciences);2010年03期
2 李臻;楊雅輝;謝高崗;覃光成;;一種基于數(shù)據(jù)流計(jì)數(shù)的概率衰落大業(yè)務(wù)流識(shí)別方法[J];計(jì)算機(jī)研究與發(fā)展;2011年06期
3 祝然威;王鵬;劉馬金;;基于計(jì)數(shù)的數(shù)據(jù)流頻繁項(xiàng)挖掘算法[J];計(jì)算機(jī)研究與發(fā)展;2011年10期
4 王秀坤;王鐵存;周國(guó)能;馮維;;挖掘數(shù)據(jù)流近似頻繁項(xiàng)的改進(jìn)算法[J];計(jì)算機(jī)工程與應(yīng)用;2008年13期
5 王偉平;李建中;張冬冬;郭龍江;;一種有效的挖掘數(shù)據(jù)流近似頻繁項(xiàng)算法[J];軟件學(xué)報(bào);2007年04期
本文編號(hào):1603253
本文鏈接:http://sikaile.net/guanlilunwen/ydhl/1603253.html
最近更新
教材專著