海量不一致數(shù)據(jù)的分類算法研究
發(fā)布時間:2018-11-29 12:07
【摘要】:近年來,隨著實際生活中的數(shù)據(jù)量不斷呈指數(shù)增大,不一致數(shù)據(jù)的出現(xiàn)也變得越發(fā)頻繁。傳統(tǒng)的方法是通過人工修正來對不一致數(shù)據(jù)進行修復(fù)校正。然而,隨著不一致數(shù)據(jù)的數(shù)據(jù)量增長趨勢呈指數(shù)增長,通過人工的方式對不一致數(shù)據(jù)進行修正也變得更加耗時。并且,隨著數(shù)據(jù)量的增大,人工修正數(shù)據(jù)也存在著不可避免的人為操作錯誤,從而導(dǎo)致數(shù)據(jù)中出現(xiàn)錯誤數(shù)據(jù)。因此,這種修正方法變的不再可行。如何能夠?qū)Σ灰恢聰?shù)據(jù)不進行人工的修正,直接在不一致數(shù)據(jù)上進行特征選擇以及分類,是本文的核心研究內(nèi)容。決策樹算法是一種性能較優(yōu)的分類算法,它對于錯誤數(shù)據(jù)、離群數(shù)據(jù)有較好的容錯性,對于建模后的樹形結(jié)構(gòu)也有較好的可釋性,能夠直觀的看出數(shù)據(jù)分類子集,因而本文選擇該算法進行改進。互信息算法通過對單個特征與目標(biāo)特征進行影響因子計算,從而衡量特征間的相關(guān)程度,并且通過共同出現(xiàn)概率來進行相關(guān)因子計算,因而,文本選擇該算法進行改進來進行特征選擇。本文首先通過改進決策樹算法,使其能夠直接對不一致數(shù)據(jù)進行分類,并得到較好的結(jié)果。文章主要研究不一致數(shù)據(jù)約束條件中的函數(shù)依賴,通過分別針對前置特征與后置特征在數(shù)據(jù)中的差異性,對其進行不同的算法設(shè)計,從而使改進后的算法對前置特征與后置特征進行不同的計算。文章通過改進決策樹算法的目標(biāo)函數(shù),改變約束條件中特征的分割計算方法,來對不一致數(shù)據(jù)進行劃分。文章主要通過多方面衡量約束條件中特征對分類結(jié)果的影響,從而調(diào)整該特征的影響因子,使得決策樹的節(jié)點分割更精確。隨著不一致數(shù)據(jù)的數(shù)據(jù)量呈指數(shù)形式的增大,數(shù)據(jù)特征的維數(shù)也隨著增加。高維度的特征使得分類模型的構(gòu)建變得耗時,而對于目標(biāo)特征來說,與其相關(guān)程度較小的特征對分類模型的效果影響較小。本文通過對特征選擇算法中的互信息算法進行改進,使其能夠?qū)Σ灰恢聰?shù)據(jù)集進行特征重要性評判,從而能夠篩選出對目標(biāo)特征影響程度最高的特征來進行分類模型建模。文章通過對約束條件中函數(shù)依賴特征區(qū)分為前置特征與后置特征,從而針對前置特征與后置特征在不一致數(shù)據(jù)中的特性,進行不同的算法改進。通過對決策樹算法與互信息算法進行改進,根據(jù)對比實驗結(jié)果可以得出,改進后的算法相比于對比算法來說,分類效果有明顯的提升。
[Abstract]:In recent years, with the increasing of data volume in real life, inconsistent data appear more and more frequently. The traditional method is to correct the inconsistent data by manual correction. However, as the data volume of inconsistent data increases exponentially, it becomes more time-consuming to modify the inconsistent data manually. Moreover, with the increase of data volume, there are inevitable human error in the data correction, which leads to the occurrence of error data in the data. As a result, this correction method is no longer feasible. How to select features and classify inconsistent data directly without artificial correction is the core of this paper. Decision tree algorithm is a kind of classification algorithm with better performance. It has good fault tolerance for error data, outlier data, and good interpretability for the tree structure after modeling, which can directly see the subset of data classification. So this paper chooses the algorithm to improve. The mutual information algorithm calculates the influence factors between the single feature and the target feature, so as to measure the correlation degree between the features, and calculate the correlation factor by co-occurrence probability. Text selection algorithm is improved to select features. In this paper, the decision tree algorithm is improved to classify the inconsistent data directly, and a better result is obtained. This paper mainly studies the function dependence in the constraint condition of inconsistent data. According to the difference of pre-feature and post-feature in data, the paper designs different algorithms. Therefore, the improved algorithm can calculate the pre- and post-features differently. By improving the objective function of the decision tree algorithm and changing the method of segmentation and calculation of the features in the constraint conditions, the inconsistent data are partitioned in this paper. In this paper, the influence of the feature on the classification result is measured in many aspects, and the influence factor of the feature is adjusted to make the node segmentation of the decision tree more accurate. As the data volume of inconsistent data increases exponentially, so does the dimension of data feature. The feature of high dimension makes the construction of classification model time-consuming, but for the target feature, the feature with less correlation degree has little influence on the effect of classification model. In this paper, the mutual information algorithm in feature selection algorithm is improved to evaluate the feature importance of inconsistent data sets, so that the features with the greatest influence on the target features can be selected to model the classification model. In this paper, the features of functional dependence in constraint conditions are distinguished into pre- and post-features, so that different algorithms are improved for the characteristics of pre- and post-features in inconsistent data. Through the improvement of decision tree algorithm and mutual information algorithm, according to the comparison experiment results, the improved algorithm can be compared with the contrast algorithm, the classification effect is obviously improved.
【學(xué)位授予單位】:哈爾濱工業(yè)大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2017
【分類號】:TP311.13
本文編號:2364954
[Abstract]:In recent years, with the increasing of data volume in real life, inconsistent data appear more and more frequently. The traditional method is to correct the inconsistent data by manual correction. However, as the data volume of inconsistent data increases exponentially, it becomes more time-consuming to modify the inconsistent data manually. Moreover, with the increase of data volume, there are inevitable human error in the data correction, which leads to the occurrence of error data in the data. As a result, this correction method is no longer feasible. How to select features and classify inconsistent data directly without artificial correction is the core of this paper. Decision tree algorithm is a kind of classification algorithm with better performance. It has good fault tolerance for error data, outlier data, and good interpretability for the tree structure after modeling, which can directly see the subset of data classification. So this paper chooses the algorithm to improve. The mutual information algorithm calculates the influence factors between the single feature and the target feature, so as to measure the correlation degree between the features, and calculate the correlation factor by co-occurrence probability. Text selection algorithm is improved to select features. In this paper, the decision tree algorithm is improved to classify the inconsistent data directly, and a better result is obtained. This paper mainly studies the function dependence in the constraint condition of inconsistent data. According to the difference of pre-feature and post-feature in data, the paper designs different algorithms. Therefore, the improved algorithm can calculate the pre- and post-features differently. By improving the objective function of the decision tree algorithm and changing the method of segmentation and calculation of the features in the constraint conditions, the inconsistent data are partitioned in this paper. In this paper, the influence of the feature on the classification result is measured in many aspects, and the influence factor of the feature is adjusted to make the node segmentation of the decision tree more accurate. As the data volume of inconsistent data increases exponentially, so does the dimension of data feature. The feature of high dimension makes the construction of classification model time-consuming, but for the target feature, the feature with less correlation degree has little influence on the effect of classification model. In this paper, the mutual information algorithm in feature selection algorithm is improved to evaluate the feature importance of inconsistent data sets, so that the features with the greatest influence on the target features can be selected to model the classification model. In this paper, the features of functional dependence in constraint conditions are distinguished into pre- and post-features, so that different algorithms are improved for the characteristics of pre- and post-features in inconsistent data. Through the improvement of decision tree algorithm and mutual information algorithm, according to the comparison experiment results, the improved algorithm can be compared with the contrast algorithm, the classification effect is obviously improved.
【學(xué)位授予單位】:哈爾濱工業(yè)大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2017
【分類號】:TP311.13
【參考文獻】
相關(guān)期刊論文 前3條
1 劉雪莉;李建中;;不一致數(shù)據(jù)上查詢結(jié)果的一致性估計[J];計算機學(xué)報;2015年09期
2 張安珍;門雪瑩;王宏志;李建中;高宏;;大數(shù)據(jù)上基于Hadoop的不一致數(shù)據(jù)檢測與修復(fù)算法[J];計算機科學(xué)與探索;2015年09期
3 李建中;劉顯敏;;大數(shù)據(jù)的一個重要方面:數(shù)據(jù)可用性[J];計算機研究與發(fā)展;2013年06期
,本文編號:2364954
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2364954.html
最近更新
教材專著