基于數據流的概念漂移檢測及集成分類研究
發(fā)布時間:2018-05-21 00:21
本文選題:數據流 + 概要結構 ; 參考:《四川師范大學》2017年碩士論文
【摘要】:大數據引領了信息時代的重要變革,影響了經濟、科技和社會等各個層面,大數據的其中一種形式以海量實時數據流的方式呈現。這些海量的實時數據中隱藏著巨大的價值,如何更好的挖掘處理這些實時數據流已經成為了國內外數據挖掘領域的研究重點和熱點。數據流具有有序性、實時性、高速性、動態(tài)性、潛在無限性等特點,對數據流的處理包含存儲、處理、分析和應用等。概要結構是用于解決數據流潛在無限性問題的處理技術,但現有的概要結構算法存在著重構數據流與原數據流相對重構誤差較大和參數難以調整的缺點。概念漂移檢測技術用于解決數據流的動態(tài)性問題,數據流集成分類具有較高的分類準確率和概念漂移適應能力而被廣泛地應用到數據流分類中。但概念漂移檢測和集成分類處理通常基于數據流標簽及時可用的假設,在實際應用中這一假設很難成立。針對這些問題,本文做了以下三方面的工作:(1)實現了基于sim Hash的數據流分層遺忘概要結構(SH-HAS)。該結構采用sim Hash算法獲取概要信息,并動態(tài)調整SH-HAS結構,解決了重構數據集與原數據集誤差較大的問題。實驗證明,SH-HAS結構具有更小的相對重構誤差。(2)改進FKNNModel概念漂移檢測算法,提出了MFKNNModel概念漂移檢測算法。MFKNNModel利用數據的空間分布的改變來檢測數據流概念漂移,并利用Spark Streaming高效并行計算來提升算法的運行效率,解決了FKNNModel算法中的人工干預及計算效率問題。實驗效果表明,在缺乏人工干預的情況下,MFKNNModel具有良好的概念漂移檢測能力和較高的運行效率。(3)提出了基于概念漂移的數據流集成分類模型(Ensemble Classifier Based on Concept-Drifting Data Stream,ECCDDS)。采用水平集成的方式生成基分類器,通過加權投票的方法對基分類器的分類結果進行投票,生成集成分類器的分類結果;ECCDDS算法首先形成數據流的概要結構,然后引入概念漂移檢測算法MFKNNModel,在發(fā)生概念漂移時更新集成分類模型,最后對數據進行分類。ECCDDS打破了集成分類器以數據流標簽及時可用為假設的前提,解決了集成分類器以分類精度作為概念漂移檢測和模型更新為依據所帶來的后序到達的數據流類標簽不能及時可用的問題。利用Spark Streaming流式計算框架解決了集成分類器在計算資源和計算效率方面的問題。在真實數據集和人工數據集上的實驗驗證了ECCDDS集成分類模型的有效性。
[Abstract]:Big data has led an important revolution in the information age, which has affected the economy, science and technology, society and so on. One of the forms of big data is presented in the form of massive real-time data flow. There is great value hidden in these massive real-time data. How to better mine these real-time data streams has become the research focus and hotspot in the field of data mining at home and abroad. Data flow has the characteristics of order, real-time, high speed, dynamic, potential infinity, etc. The processing of data flow includes storage, processing, analysis and application. Summary structure is a processing technique used to solve the potential infinity problem of data flow. However, the existing algorithms of summary structure have some disadvantages such as the relative error between reconstructing data stream and original data stream is large, and the parameters are difficult to adjust. Conceptual drift detection technique is used to solve the dynamic problem of data flow. Data stream integrated classification has high classification accuracy and concept drift adaptability, so it is widely used in data stream classification. However, conceptual drift detection and ensemble classification are usually based on the assumption that data stream tags are available in time, which is difficult to establish in practical applications. In order to solve these problems, the following three aspects of work are done: 1) A hierarchical forgetting summary structure based on sim Hash is implemented. The structure adopts sim Hash algorithm to obtain the summary information and dynamically adjusts the SH-HAS structure to solve the problem of large error between the reconstructed dataset and the original data set. Experimental results show that the SH-HAS structure has a smaller relative reconstruction error. It improves the FKNNModel concept drift detection algorithm. An MFKNNModel concept drift detection algorithm .MFKN NModel is proposed to detect the conceptual drift of the data stream by changing the spatial distribution of the data. The efficient parallel computing of Spark Streaming is used to improve the efficiency of the algorithm, and the problem of manual intervention and computational efficiency in the FKNNModel algorithm is solved. The experimental results show that the MFKN Model has good concept drift detection ability and high running efficiency without manual intervention.) an integrated data stream classification model based on conceptual drift is proposed, which is called Ensemble Classifier Based on Concept-Drifting Data Stream-ECCDDSs. The basic classifier is generated by horizontal integration, and the classification result of the base classifier is voted by weighted voting method. The classification results of the integrated classifier are generated and the ECCDDS algorithm first forms the summary structure of the data stream. Then the concept drift detection algorithm MFKN Model is introduced to update the integrated classification model when the concept drift occurs. Finally, the data classification .ECCDDS breaks the premise of the integrated classifier that the data stream labels are available in time. It solves the problem that the data stream class labels arrived in the order of the integrated classifier based on the classification precision as the basis of the concept drift detection and model updating can not be used in time. The problem of integrated classifier in computing resources and computing efficiency is solved by using Spark Streaming flow computing framework. Experiments on real data sets and human data sets verify the effectiveness of the ECCDDS integrated classification model.
【學位授予單位】:四川師范大學
【學位級別】:碩士
【學位授予年份】:2017
【分類號】:TP311.13
【參考文獻】
相關期刊論文 前10條
1 黃樹成;劉悅;;一種抗噪的動態(tài)數據流分類算法[J];江蘇科技大學學報(自然科學版);2016年03期
2 陳笑蓉;劉作國;;文本聚類的重構策略研究[J];中文信息學報;2016年02期
3 胡小生;溫菊屏;鐘勇;;動態(tài)平衡采樣的不平衡數據集成分類方法[J];智能系統(tǒng)學報;2016年02期
4 孫雪;李昆侖;韓蕾;白曉亮;;基于特征項分布的信息熵及特征動態(tài)加權概念漂移檢測模型[J];電子學報;2015年07期
5 郭文鋒;王勇;;基于累積正樣本的偏斜數據流集成分類方法[J];計算機與現代化;2015年03期
6 李勇;劉戰(zhàn)東;張海軍;;不平衡數據的集成分類算法綜述[J];計算機應用研究;2014年05期
7 李南;郭躬德;陳黎飛;;基于少量類標簽的概念漂移檢測算法[J];計算機應用;2012年08期
8 徐文華;覃征;常揚;;基于半監(jiān)督學習的數據流集成分類算法[J];模式識別與人工智能;2012年02期
9 歐陽震諍;陶孜謹;蔡建宇;吳泉源;;一種不平衡噪聲數據流集成分類模型[J];計算機工程與科學;2011年12期
10 張玉紅;胡學鋼;李培培;;一種抗噪的概念漂移數據流分類方法[J];中國科學技術大學學報;2011年04期
,本文編號:1916899
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1916899.html