處理靜態(tài)數(shù)據(jù)和流數(shù)據(jù)中離群點(diǎn)檢測問題的有效方法
發(fā)布時(shí)間:2024-02-20 04:42
數(shù)據(jù)的可訪問性、便捷性和可靠性是十分關(guān)鍵的,任何形式的干凈數(shù)據(jù)都已成為當(dāng)今社會(huì)中人類的新財(cái)富。在許多領(lǐng)域里,由于數(shù)據(jù)本身大容量和高速傳輸?shù)奶攸c(diǎn)所帶來的巨大挑戰(zhàn),維護(hù)高質(zhì)量數(shù)據(jù)的能力已經(jīng)變得十分重要。數(shù)據(jù)可以給各個(gè)行業(yè)的企業(yè)提供對(duì)其企業(yè)活動(dòng)的價(jià)值分析進(jìn)而幫助企業(yè)激發(fā)其最佳潛力,并在與對(duì)手競爭中獲得更大的優(yōu)勢。因此企業(yè)現(xiàn)在大力投資研發(fā)數(shù)據(jù)挖掘技能,期待從不同類型數(shù)據(jù)中發(fā)現(xiàn)隱性的數(shù)據(jù)價(jià)值。離群點(diǎn)檢測是一項(xiàng)非常重要的數(shù)據(jù)挖掘任務(wù),其目的是檢測偏離正常數(shù)據(jù)預(yù)期模式的對(duì)象,因?yàn)殡x群點(diǎn)有極大可能影響數(shù)據(jù)分析結(jié)果。離群點(diǎn)檢測是一個(gè)在不同領(lǐng)域、不同數(shù)據(jù)類型中有著廣泛應(yīng)用的重要問題。離群點(diǎn)有許多潛在的來源,在大數(shù)據(jù)集中識(shí)別它們需要有效的方法。隨著數(shù)字時(shí)代的發(fā)展,離群點(diǎn)的檢測變得越來越具有挑戰(zhàn)性。例如,隨著傳統(tǒng)批處理數(shù)據(jù)的革命,我們現(xiàn)在看到大量的數(shù)據(jù)以高速、動(dòng)態(tài)的方式連續(xù)生成。這些類型的數(shù)據(jù)可能包含冗余信息,并且通常會(huì)影響離群點(diǎn)檢測方法的效率和總體性能。多年來,為解決離群點(diǎn)檢測帶來的挑戰(zhàn),使用不同算法的方法和技術(shù)被提出。一些常見的困難與輸入數(shù)據(jù)的性質(zhì)、離群值類型、數(shù)據(jù)標(biāo)簽、準(zhǔn)確性以及CPU時(shí)間和內(nèi)存消耗方面...
【文章頁數(shù)】:163 頁
【學(xué)位級(jí)別】:博士
【文章目錄】:
摘要
Abstract
Abbreviations
Chapter 1 Introduction
1.1 Motivation
1.2 Fundamental Concepts
1.2.1 The Definition of Outliers
1.2.2 Static Data
1.2.3 Streaming Data
1.2.4 Causes of Outliers, Identification Process and Handling Process
1.2.5 Application Areas of Outlier Detection
1.3 Research Goals and Contributions
1.4 Main Contents and Technological Route
Chapter 2 Related Work-Progress in Outlier Detection Techniques
2.1 Outlier Detection Methods
2.2 Statistical-Based Approaches
2.2.1 Parametric Methods
2.2.2 Non-Parametric Methods
2.2.3 Advantages, Disadvantages and Challenges
2.3 Distance-Based Approaches
2.3.1 K-Nearest Neighbor Methods
2.3.2 Pruning Methods
2.3.3 Data Stream Methods
2.3.4 Advantages, Disadvantages and Challenges
2.4 Clustering-Based Approaches
2.4.1 Partitioning and Hierarchical Clustering Methods
2.4.2 Density-based and Grid-based Clustering Methods
2.4.3 Advantages, Disadvantages and Challenges
2.5 Chapter Summary
Chapter 3 Parametric and Non-Parametric Approach for High-AccurateOutlier Detection in Static Data
3.1 Introduction
3.2 Parametric Approach
3.2.1 Gaussian Mixture Model for Outlier Detection (GMMOD)
3.2.2 Learning Model and Algorithms
3.2.3 The GMMOD Algorithm
3.3 Non-Parametric Approach
3.3.1 Kernel Density Estimation for Outlier Detection (KDEOD)
3.3.2 Bandwidth Selection
3.3.3 The KDEOD Algorithm
3.4 Experimental Evaluation
3.4.1 Experimental Setup
3.4.2 Data Description
3.4.3 Performance Evaluation
3.4.4 Experimental Results
3.4.5 Discussion
3.5 Chapter Summary
Chapter 4 An Effective Minimal Probing Approach Distance-Based Outlier Detection in Data Streams
4.1 Introduction
4.2 Definition of Key Terms
4.3 Problem Formulation
4.4 Methodology
4.4.1 Micro-Cluster with Minimal Probing
4.4.2 Data Points Within the Current Window
4.4.3 Processing the New Data Points and New Slide
4.4.4 Processing the Expired Window and Slide
4.4.5 Processing and Reporting Outliers
4.5 Experiments and Results
4.5.1 Varying Window Size, W
4.5.2 Varying the Nearest Neighbor Count, K
4.5.3 Varying the Distance Threshold, R
4.5.4 Complexity Analysis
4.5.5 The Advantage and Disadvantages of the Proposed Method
4.6 Chapter Summary
Chapter 5 CLODS: An Effective Clustering-Based Technique for DetectingOutliers in Data Streams
5.1 Introduction
5.2 Preliminaries and Problem Statement
5.3 Methodology
5.3.1 Fundamentals of the Proposed Method
5.3.2 The Proposed Framework
5.3.3 The Data Stream Stage
5.3.4 Data Preprocessing Stage
5.3.5 Sliding Window Based Outlier Detection Stage
5.3.6 The Clustering Process Stage
5.3.7 The Outlier Detection Stage
5.4 Experimental Setup and Results
5.4.1 Experimental Setup
5.4.2 Results and Discussions
5.4.2.1 CPU Time
5.4.2.2 Memory Usage
5.4.2.3 Space and Time Complexity
5.4.2.4 Data Points in Cluster
5.5 Chapter Summary
Conclusions and Future Work
References
List of Publications
Acknowledgements
Resume
本文編號(hào):3903857
【文章頁數(shù)】:163 頁
【學(xué)位級(jí)別】:博士
【文章目錄】:
摘要
Abstract
Abbreviations
Chapter 1 Introduction
1.1 Motivation
1.2 Fundamental Concepts
1.2.1 The Definition of Outliers
1.2.2 Static Data
1.2.3 Streaming Data
1.2.4 Causes of Outliers, Identification Process and Handling Process
1.2.5 Application Areas of Outlier Detection
1.3 Research Goals and Contributions
1.4 Main Contents and Technological Route
Chapter 2 Related Work-Progress in Outlier Detection Techniques
2.1 Outlier Detection Methods
2.2 Statistical-Based Approaches
2.2.1 Parametric Methods
2.2.2 Non-Parametric Methods
2.2.3 Advantages, Disadvantages and Challenges
2.3 Distance-Based Approaches
2.3.1 K-Nearest Neighbor Methods
2.3.2 Pruning Methods
2.3.3 Data Stream Methods
2.3.4 Advantages, Disadvantages and Challenges
2.4 Clustering-Based Approaches
2.4.1 Partitioning and Hierarchical Clustering Methods
2.4.2 Density-based and Grid-based Clustering Methods
2.4.3 Advantages, Disadvantages and Challenges
2.5 Chapter Summary
Chapter 3 Parametric and Non-Parametric Approach for High-AccurateOutlier Detection in Static Data
3.1 Introduction
3.2 Parametric Approach
3.2.1 Gaussian Mixture Model for Outlier Detection (GMMOD)
3.2.2 Learning Model and Algorithms
3.2.3 The GMMOD Algorithm
3.3 Non-Parametric Approach
3.3.1 Kernel Density Estimation for Outlier Detection (KDEOD)
3.3.2 Bandwidth Selection
3.3.3 The KDEOD Algorithm
3.4 Experimental Evaluation
3.4.1 Experimental Setup
3.4.2 Data Description
3.4.3 Performance Evaluation
3.4.4 Experimental Results
3.4.5 Discussion
3.5 Chapter Summary
Chapter 4 An Effective Minimal Probing Approach Distance-Based Outlier Detection in Data Streams
4.1 Introduction
4.2 Definition of Key Terms
4.3 Problem Formulation
4.4 Methodology
4.4.1 Micro-Cluster with Minimal Probing
4.4.2 Data Points Within the Current Window
4.4.3 Processing the New Data Points and New Slide
4.4.4 Processing the Expired Window and Slide
4.4.5 Processing and Reporting Outliers
4.5 Experiments and Results
4.5.1 Varying Window Size, W
4.5.2 Varying the Nearest Neighbor Count, K
4.5.3 Varying the Distance Threshold, R
4.5.4 Complexity Analysis
4.5.5 The Advantage and Disadvantages of the Proposed Method
4.6 Chapter Summary
Chapter 5 CLODS: An Effective Clustering-Based Technique for DetectingOutliers in Data Streams
5.1 Introduction
5.2 Preliminaries and Problem Statement
5.3 Methodology
5.3.1 Fundamentals of the Proposed Method
5.3.2 The Proposed Framework
5.3.3 The Data Stream Stage
5.3.4 Data Preprocessing Stage
5.3.5 Sliding Window Based Outlier Detection Stage
5.3.6 The Clustering Process Stage
5.3.7 The Outlier Detection Stage
5.4 Experimental Setup and Results
5.4.1 Experimental Setup
5.4.2 Results and Discussions
5.4.2.1 CPU Time
5.4.2.2 Memory Usage
5.4.2.3 Space and Time Complexity
5.4.2.4 Data Points in Cluster
5.5 Chapter Summary
Conclusions and Future Work
References
List of Publications
Acknowledgements
Resume
本文編號(hào):3903857
本文鏈接:http://sikaile.net/kejilunwen/shengwushengchang/3903857.html
最近更新
教材專著