基于密度網(wǎng)格的數(shù)據(jù)流聚類(lèi)和概念漂移檢測(cè)算法研究

發(fā)布時(shí)間：2018-06-17 11:09

本文選題：數(shù)據(jù)挖掘 + 數(shù)據(jù)流��；參考：《北京交通大學(xué)》2017年碩士論文

【摘要】：數(shù)據(jù)流聚類(lèi)算法是一項(xiàng)關(guān)鍵的數(shù)據(jù)挖掘技術(shù),在數(shù)據(jù)流聚類(lèi)研究中,算法框架可以分為兩類(lèi):single-phase model 和 two-phase scheme。應(yīng)用 two-phase scheme 的基于密度網(wǎng)格的數(shù)據(jù)流聚類(lèi)框架,包含了在線處理階段和離線處理階段。在線處理階段中,將據(jù)流數(shù)據(jù)映射到網(wǎng)格中,在離線處理階段中,對(duì)網(wǎng)格數(shù)據(jù)聚類(lèi),此框架降低了數(shù)據(jù)流聚類(lèi)的難度。但是在離線處理階段中,這種聚類(lèi)框架也存在三點(diǎn)缺陷:(1)基于固定閾值的稀疏網(wǎng)格或稠密網(wǎng)格判定不能適用于不均勻分布的數(shù)據(jù)流和多密度的數(shù)據(jù)流;(2)基于密度把相鄰的網(wǎng)格連接為一類(lèi),而沒(méi)有考慮數(shù)據(jù)之間的相似度,數(shù)據(jù)間相似度考量的缺失會(huì)影響數(shù)據(jù)聚類(lèi)的準(zhǔn)確性;(3)邊界點(diǎn)的檢測(cè)考量不夠全面,有的邊界點(diǎn)是噪音,而有的邊界點(diǎn)可能屬于鄰近的簇。數(shù)據(jù)流的概念也會(huì)隨著時(shí)間的推移而改變,這種現(xiàn)象被稱(chēng)為概念漂移。DCDA是一種基于粗糙集理論和滑動(dòng)窗口技術(shù)的概念漂移檢測(cè)算法,其主要思想是:計(jì)算兩個(gè)滑動(dòng)窗口之間的距離判斷概念漂移。這種算法存在如下缺陷:(1)只適用于分類(lèi)型數(shù)據(jù);(2)沒(méi)有考慮一個(gè)窗口中包含多概念的情況;(3)無(wú)法確定合適的滑動(dòng)窗口尺寸。針對(duì)以上問(wèn)題,本文的主要貢獻(xiàn)如下:第一,針對(duì)DCDA概念漂移檢測(cè)存在的缺陷,提出了一種基于密度網(wǎng)格的數(shù)據(jù)流概念漂移檢測(cè)框架(簡(jiǎn)稱(chēng)DCDD)。該框架利用網(wǎng)格技術(shù),進(jìn)而使得其適用于一般的數(shù)據(jù)。在解決滑動(dòng)窗口中多概念問(wèn)題上,在在線處理階段中創(chuàng)建一個(gè)臨時(shí)密度網(wǎng)格和一個(gè)歷史密度網(wǎng)格,根據(jù)數(shù)據(jù)集到達(dá)時(shí)間給網(wǎng)格賦予一個(gè)權(quán)值擴(kuò)展了DCDA檢測(cè)模型,計(jì)算臨時(shí)密度網(wǎng)格和歷史密度網(wǎng)格的距離檢測(cè)概念漂移。在離線處理階段中訓(xùn)練提取的概念漂移特征,提出一個(gè)預(yù)測(cè)模型,預(yù)測(cè)概念數(shù)據(jù)量,并根據(jù)預(yù)測(cè)量設(shè)計(jì)了可變尺寸的滑動(dòng)窗口。實(shí)驗(yàn)結(jié)果表明,我們檢測(cè)概念漂移的時(shí)間遠(yuǎn)低于DCDA算法,且檢測(cè)的概念漂移更準(zhǔn)確,更有效。第二,針對(duì)基于密度網(wǎng)格的數(shù)據(jù)流聚類(lèi)框架的缺陷,提出了一種基于相對(duì)密度網(wǎng)格的數(shù)據(jù)流聚類(lèi)算法和邊界檢測(cè)算法。其主要思想是:計(jì)算相鄰網(wǎng)格之間的相似性,并根據(jù)相似性作為權(quán)重去影響相鄰網(wǎng)格之間的連接,而連接相鄰網(wǎng)格是根據(jù)一個(gè)考慮了密度、質(zhì)心和相鄰網(wǎng)格之間的相似性權(quán)重的差異模型。最后,.我們提出了一個(gè)邊界檢測(cè)算法,使用隸屬函數(shù)給簇周?chē)∈杈W(wǎng)格中的數(shù)據(jù)點(diǎn)打上簇標(biāo)簽。實(shí)驗(yàn)結(jié)果表明,我們的算法適用于多密度分布的數(shù)據(jù)流,且具有更好的聚類(lèi)質(zhì)量。
[Abstract]:Data stream clustering algorithm is a key data mining technology. In data stream clustering research, the algorithm framework can be divided into two categories: single-phase model and two-phase schema. The data flow clustering framework based on density grid of two-phase scheme is used, which includes on-line processing stage and off-line processing stage. In the on-line processing stage, the streaming data is mapped to the grid, and in the off-line processing stage, the grid data clustering is reduced by this framework. But in the off-line processing phase, The clustering framework also has three defects: 1) sparse or dense grid decision based on fixed threshold is not suitable for inhomogeneous distributed data flow and multi-density data stream / 2) the adjacent grids are connected as a class based on density. Without considering the similarity between data, the lack of similarity consideration will affect the accuracy of data clustering. The detection of boundary points is not comprehensive enough, some of the boundary points are noise, and some of the boundary points may belong to adjacent clusters. The concept of data flow also changes over time. This phenomenon is called concept drift. DCDA is a concept drift detection algorithm based on rough set theory and sliding window technology. The main idea is to calculate the distance between two sliding windows to judge the concept drift. This algorithm has the following defects: 1) it can only be applied to classified data / 2) and does not consider the case where a window contains multiple concepts) it is unable to determine the appropriate sliding window size. The main contributions of this paper are as follows: first, a DCDA conceptual drift detection framework based on density grid (DCDD) is proposed to overcome the shortcomings of DCDA concept drift detection. The framework makes use of grid technology to make it applicable to general data. In order to solve the problem of multi-concept in sliding window, a temporary density grid and a historical density grid are created in the online processing phase, and the DCDA detection model is extended to the grid according to the time of arrival of the data set. The distance detection concept drift of temporary density grid and historical density grid is calculated. In the off-line processing stage, the concept drift features are trained, a prediction model is proposed to predict the conceptual data volume, and a sliding window with variable size is designed according to the prediction quantity. The experimental results show that the detection time of the concept drift is much lower than that of the DCDA algorithm, and the detection of the concept drift is more accurate and effective. Secondly, a data stream clustering algorithm and a boundary detection algorithm based on relative density grid are proposed to overcome the shortcomings of the data stream clustering framework based on density grid. The main idea is to calculate the similarity between adjacent meshes and use the similarity as the weight to affect the connection between adjacent meshes. The difference model of similarity weight between centroid and adjacent grid. Finally. We propose a boundary detection algorithm which uses membership functions to label data points in sparse grids around clusters. Experimental results show that our algorithm is suitable for multi-density data flow and has better clustering quality.
【學(xué)位授予單位】：北京交通大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2017
【分類(lèi)號(hào)】：TP311.13

【參考文獻(xiàn)】

相關(guān)期刊論文前10條

1 唐穎峰;陳世平;;一種基于網(wǎng)格塊的分布式數(shù)據(jù)流聚類(lèi)算法[J];小型微型計(jì)算機(jī)系統(tǒng);2016年03期

2 王習(xí)特;申德榮;白梅;聶鐵錚;寇月;于戈;;BOD:一種高效的分布式離群點(diǎn)檢測(cè)算法[J];計(jì)算機(jī)學(xué)報(bào);2016年01期

3 Amineh Amini;Teh Ying Wah;Hadi Saboohi;;On Density-Based Data Streams Clustering Algorithms: A Survey[J];Journal of Computer Science & Technology;2014年01期

4 朱林;雷景生;畢忠勤;楊杰;;一種基于數(shù)據(jù)流的軟子空間聚類(lèi)算法[J];軟件學(xué)報(bào);2013年11期

5 孫吉貴;劉杰;趙連宇;;聚類(lèi)算法研究[J];軟件學(xué)報(bào);2008年01期

6 常建龍;曹鋒;周傲英+;;基于滑動(dòng)窗口的進(jìn)化數(shù)據(jù)流聚類(lèi)[J];軟件學(xué)報(bào);2007年04期

7 周曉云;孫志揮;張柏禮;楊宜東;;高維數(shù)據(jù)流聚類(lèi)及其演化分析研究[J];計(jì)算機(jī)研究與發(fā)展;2006年11期

8 秦首科;錢(qián)衛(wèi)寧;周傲英;;基于分形技術(shù)的數(shù)據(jù)流突變檢測(cè)算法[J];軟件學(xué)報(bào);2006年09期

9 王U，

本文編號(hào)：2030832

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2030832.html

上一篇：面向電信運(yùn)營(yíng)商的用戶自助服務(wù)系統(tǒng)設(shè)計(jì)
下一篇：基于圖模型的云平臺(tái)應(yīng)用部署技術(shù)研究

論文發(fā)表

·知網(wǎng)|萬(wàn)方|維普|龍?jiān)磡省級(jí)|國(guó)家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于密度網(wǎng)格的數(shù)據(jù)流聚類(lèi)和概念漂移檢測(cè)算法研究