基于密度網(wǎng)格的數(shù)據(jù)流聚類和概念漂移檢測算法研究
本文選題:數(shù)據(jù)挖掘 + 數(shù)據(jù)流 ; 參考:《北京交通大學》2017年碩士論文
【摘要】:數(shù)據(jù)流聚類算法是一項關(guān)鍵的數(shù)據(jù)挖掘技術(shù),在數(shù)據(jù)流聚類研究中,算法框架可以分為兩類:single-phase model 和 two-phase scheme。應用 two-phase scheme 的基于密度網(wǎng)格的數(shù)據(jù)流聚類框架,包含了在線處理階段和離線處理階段。在線處理階段中,將據(jù)流數(shù)據(jù)映射到網(wǎng)格中,在離線處理階段中,對網(wǎng)格數(shù)據(jù)聚類,此框架降低了數(shù)據(jù)流聚類的難度。但是在離線處理階段中,這種聚類框架也存在三點缺陷:(1)基于固定閾值的稀疏網(wǎng)格或稠密網(wǎng)格判定不能適用于不均勻分布的數(shù)據(jù)流和多密度的數(shù)據(jù)流;(2)基于密度把相鄰的網(wǎng)格連接為一類,而沒有考慮數(shù)據(jù)之間的相似度,數(shù)據(jù)間相似度考量的缺失會影響數(shù)據(jù)聚類的準確性;(3)邊界點的檢測考量不夠全面,有的邊界點是噪音,而有的邊界點可能屬于鄰近的簇。數(shù)據(jù)流的概念也會隨著時間的推移而改變,這種現(xiàn)象被稱為概念漂移。DCDA是一種基于粗糙集理論和滑動窗口技術(shù)的概念漂移檢測算法,其主要思想是:計算兩個滑動窗口之間的距離判斷概念漂移。這種算法存在如下缺陷:(1)只適用于分類型數(shù)據(jù);(2)沒有考慮一個窗口中包含多概念的情況;(3)無法確定合適的滑動窗口尺寸。針對以上問題,本文的主要貢獻如下:第一,針對DCDA概念漂移檢測存在的缺陷,提出了一種基于密度網(wǎng)格的數(shù)據(jù)流概念漂移檢測框架(簡稱DCDD)。該框架利用網(wǎng)格技術(shù),進而使得其適用于一般的數(shù)據(jù)。在解決滑動窗口中多概念問題上,在在線處理階段中創(chuàng)建一個臨時密度網(wǎng)格和一個歷史密度網(wǎng)格,根據(jù)數(shù)據(jù)集到達時間給網(wǎng)格賦予一個權(quán)值擴展了DCDA檢測模型,計算臨時密度網(wǎng)格和歷史密度網(wǎng)格的距離檢測概念漂移。在離線處理階段中訓練提取的概念漂移特征,提出一個預測模型,預測概念數(shù)據(jù)量,并根據(jù)預測量設計了可變尺寸的滑動窗口。實驗結(jié)果表明,我們檢測概念漂移的時間遠低于DCDA算法,且檢測的概念漂移更準確,更有效。第二,針對基于密度網(wǎng)格的數(shù)據(jù)流聚類框架的缺陷,提出了一種基于相對密度網(wǎng)格的數(shù)據(jù)流聚類算法和邊界檢測算法。其主要思想是:計算相鄰網(wǎng)格之間的相似性,并根據(jù)相似性作為權(quán)重去影響相鄰網(wǎng)格之間的連接,而連接相鄰網(wǎng)格是根據(jù)一個考慮了密度、質(zhì)心和相鄰網(wǎng)格之間的相似性權(quán)重的差異模型。最后,.我們提出了一個邊界檢測算法,使用隸屬函數(shù)給簇周圍稀疏網(wǎng)格中的數(shù)據(jù)點打上簇標簽。實驗結(jié)果表明,我們的算法適用于多密度分布的數(shù)據(jù)流,且具有更好的聚類質(zhì)量。
[Abstract]:Data stream clustering algorithm is a key data mining technology. In data stream clustering research, the algorithm framework can be divided into two categories: single-phase model and two-phase schema. The data flow clustering framework based on density grid of two-phase scheme is used, which includes on-line processing stage and off-line processing stage. In the on-line processing stage, the streaming data is mapped to the grid, and in the off-line processing stage, the grid data clustering is reduced by this framework. But in the off-line processing phase, The clustering framework also has three defects: 1) sparse or dense grid decision based on fixed threshold is not suitable for inhomogeneous distributed data flow and multi-density data stream / 2) the adjacent grids are connected as a class based on density. Without considering the similarity between data, the lack of similarity consideration will affect the accuracy of data clustering. The detection of boundary points is not comprehensive enough, some of the boundary points are noise, and some of the boundary points may belong to adjacent clusters. The concept of data flow also changes over time. This phenomenon is called concept drift. DCDA is a concept drift detection algorithm based on rough set theory and sliding window technology. The main idea is to calculate the distance between two sliding windows to judge the concept drift. This algorithm has the following defects: 1) it can only be applied to classified data / 2) and does not consider the case where a window contains multiple concepts) it is unable to determine the appropriate sliding window size. The main contributions of this paper are as follows: first, a DCDA conceptual drift detection framework based on density grid (DCDD) is proposed to overcome the shortcomings of DCDA concept drift detection. The framework makes use of grid technology to make it applicable to general data. In order to solve the problem of multi-concept in sliding window, a temporary density grid and a historical density grid are created in the online processing phase, and the DCDA detection model is extended to the grid according to the time of arrival of the data set. The distance detection concept drift of temporary density grid and historical density grid is calculated. In the off-line processing stage, the concept drift features are trained, a prediction model is proposed to predict the conceptual data volume, and a sliding window with variable size is designed according to the prediction quantity. The experimental results show that the detection time of the concept drift is much lower than that of the DCDA algorithm, and the detection of the concept drift is more accurate and effective. Secondly, a data stream clustering algorithm and a boundary detection algorithm based on relative density grid are proposed to overcome the shortcomings of the data stream clustering framework based on density grid. The main idea is to calculate the similarity between adjacent meshes and use the similarity as the weight to affect the connection between adjacent meshes. The difference model of similarity weight between centroid and adjacent grid. Finally. We propose a boundary detection algorithm which uses membership functions to label data points in sparse grids around clusters. Experimental results show that our algorithm is suitable for multi-density data flow and has better clustering quality.
【學位授予單位】:北京交通大學
【學位級別】:碩士
【學位授予年份】:2017
【分類號】:TP311.13
【參考文獻】
相關(guān)期刊論文 前10條
1 唐穎峰;陳世平;;一種基于網(wǎng)格塊的分布式數(shù)據(jù)流聚類算法[J];小型微型計算機系統(tǒng);2016年03期
2 王習特;申德榮;白梅;聶鐵錚;寇月;于戈;;BOD:一種高效的分布式離群點檢測算法[J];計算機學報;2016年01期
3 Amineh Amini;Teh Ying Wah;Hadi Saboohi;;On Density-Based Data Streams Clustering Algorithms: A Survey[J];Journal of Computer Science & Technology;2014年01期
4 朱林;雷景生;畢忠勤;楊杰;;一種基于數(shù)據(jù)流的軟子空間聚類算法[J];軟件學報;2013年11期
5 孫吉貴;劉杰;趙連宇;;聚類算法研究[J];軟件學報;2008年01期
6 常建龍;曹鋒;周傲英+;;基于滑動窗口的進化數(shù)據(jù)流聚類[J];軟件學報;2007年04期
7 周曉云;孫志揮;張柏禮;楊宜東;;高維數(shù)據(jù)流聚類及其演化分析研究[J];計算機研究與發(fā)展;2006年11期
8 秦首科;錢衛(wèi)寧;周傲英;;基于分形技術(shù)的數(shù)據(jù)流突變檢測算法[J];軟件學報;2006年09期
9 王U,
本文編號:2030832
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2030832.html