基于多樣化Top-k Shapelets的時(shí)間序列分類方法研究
發(fā)布時(shí)間:2018-10-08 20:09
【摘要】:時(shí)間序列是指將某種現(xiàn)象某一個(gè)統(tǒng)計(jì)指標(biāo)在不同時(shí)間上的數(shù)值按時(shí)間先后順序形成的序列。由于真實(shí)系統(tǒng)或現(xiàn)象的內(nèi)部通常會(huì)受到多種因素的影響,從而導(dǎo)致輸出的時(shí)間序列具有許多復(fù)雜的表現(xiàn):維度高、結(jié)構(gòu)復(fù)雜、存在噪聲以及存在相似性變形等。傳統(tǒng)時(shí)間序列分析方法采用統(tǒng)計(jì)學(xué)方法對(duì)時(shí)間序列進(jìn)行建模,但其復(fù)雜的特性使得構(gòu)建的模型很難滿足實(shí)際系統(tǒng)的要求,因此基于數(shù)據(jù)挖掘的時(shí)間序列研究方法應(yīng)運(yùn)而生,使得時(shí)間序列挖掘成為一個(gè)活躍的研究領(lǐng)域。時(shí)間序列分類是時(shí)間序列數(shù)據(jù)挖掘領(lǐng)域的一類重要研究?jī)?nèi)容,其任務(wù)是通過(guò)構(gòu)建分類器為給定的時(shí)間序列數(shù)據(jù)分配一個(gè)類標(biāo)號(hào)。作為一種針對(duì)局部形態(tài)特征的分類方法,shapelets能夠區(qū)分子序列之間微小的差別,從而獲得良好的分類效果,在醫(yī)療診斷、姿勢(shì)識(shí)別等多個(gè)領(lǐng)域得到應(yīng)用,但仍然存在亟待解決的問(wèn)題。本文針對(duì)這些問(wèn)題,所做的主要研究?jī)?nèi)容如下:(1)針對(duì)現(xiàn)有基于shapelets的分類方法中最優(yōu)shapelets集合存在冗余的問(wèn)題,提出了一種基于多樣化top-k shapelets轉(zhuǎn)換的時(shí)間序列分類方法(Div Top KShapelet)。本文引入數(shù)據(jù)檢索領(lǐng)域的多樣化top-k查詢方法,提出了多樣化top-k shapelets的概念及相對(duì)應(yīng)的多樣化top-k shapelets圖,對(duì)候選的shapelets進(jìn)行處理,從中選出最具有辨別能力且彼此不相似的shapelets,同時(shí),使用SAX技術(shù)對(duì)原始的時(shí)間序列數(shù)據(jù)集進(jìn)行降維。實(shí)驗(yàn)結(jié)果表明:該方法不僅比傳統(tǒng)分類方法具有更高的準(zhǔn)確率,而且與使用聚類篩選的方法(Cluster Shapelet)和shapelets覆蓋的方法(Shapelet Selection)相比,分類準(zhǔn)確率最多提高了48.43%和32.61%;同時(shí)在所有15個(gè)數(shù)據(jù)集上均有計(jì)算效率的提升,最少加速了1.09倍,最高可達(dá)到287.8倍。(2)針對(duì)現(xiàn)有shapelets分類方法不能解決不平衡時(shí)間序列分類的問(wèn)題,提出了基于多樣化top-k shapelets轉(zhuǎn)換的時(shí)間序列分類方法(Div IMShapelet+SMOTE)。將不平衡數(shù)據(jù)分類評(píng)價(jià)指標(biāo)AUC,代替?zhèn)鹘y(tǒng)的信息熵作為衡量shapelets的標(biāo)準(zhǔn),并利用多樣化top-k shapelets對(duì)訓(xùn)練集進(jìn)行轉(zhuǎn)換,最后使用SMOTE方法對(duì)轉(zhuǎn)換后的訓(xùn)練集進(jìn)行過(guò)采樣。該方法利用AUC值對(duì)不平衡數(shù)據(jù)不敏感的特性,使shapelets特征更能準(zhǔn)確評(píng)估分類的準(zhǔn)確性,不僅可以有效提取時(shí)間序列特征,而且在特征的基礎(chǔ)上進(jìn)行數(shù)據(jù)集的平衡處理。實(shí)驗(yàn)表明:與Div Top KShapelet和INOS+SVM方法相比,Div IMShapelet+SMOTE的效果最好,分類準(zhǔn)確率最多提高了38.8%和10.2%,AUC最多提高了0.37和0.08,F-measure最多提高了0.35和0.15,能夠有效處理不平衡時(shí)間序列數(shù)據(jù)分類問(wèn)題。
[Abstract]:A time series is a series in which the values of a certain statistical index in different time are formed in order of time. Because the interior of real system or phenomenon is usually affected by many factors, the output time series have many complex manifestations: high dimension, complex structure, noise and similarity deformation. The traditional time series analysis method uses the statistical method to model the time series, but its complex characteristics make it difficult to meet the requirements of the actual system, so the time series research method based on data mining emerges as the times require. It makes time series mining an active research field. Time series classification is an important research content in the field of time series data mining. Its task is to assign a class number to a given time series data by constructing a classifier. As a classification method based on local morphological features, shapelets can make small differences between molecular sequences, thus obtaining good classification effect. It has been applied in many fields, such as medical diagnosis, posture recognition, etc. But there are still problems to be solved. The main research contents of this paper are as follows: (1) aiming at the redundancy of optimal shapelets set in existing classification methods based on shapelets, a time series classification method based on diversified top-k shapelets transformation, (Div Top KShapelet)., is proposed. In this paper, we introduce the diversified top-k query method in the field of data retrieval, propose the concept of diversified top-k shapelets and the corresponding diversified top-k shapelets diagram, process the candidate shapelets, and select the most discriminative and dissimilar shapelets, simultaneously. Using SAX technology to reduce the dimension of the original time series data set. The experimental results show that the proposed method not only has a higher accuracy than the traditional classification method, but also compares with the clustering filtering method (Cluster Shapelet) and the shapelets covering method (Shapelet Selection). The accuracy of classification is increased by 48.43% and 32.61%, and the computational efficiency is improved on all 15 data sets, which accelerates at least 1.09 times and can reach 287.8 times. (2) the existing shapelets classification method can not solve the problem of unbalanced time series classification. A time series classification method, (Div IMShapelet SMOTE)., based on diversified top-k shapelets transformation is proposed. The unbalanced data classification and evaluation index (AUC,) is used to replace the traditional information entropy as the standard to measure shapelets, and the training set is converted by using diversified top-k shapelets. Finally, the transformed training set is oversampled by SMOTE method. In this method, the AUC value is insensitive to unbalanced data, so that the shapelets feature can evaluate the accuracy of classification more accurately. It can not only extract the feature of time series effectively, but also deal with the balance of data set on the basis of feature. The experimental results show that compared with Div Top KShapelet and INOS SVM methods, Div IMShapelet SMOTE has the best effect. The classification accuracy is increased by 38.8% and 10.2% respectively. The maximum increases of 0.37 and 0.08 F-measure are 0.37 and 0.35 and 0.15, respectively, which can effectively deal with the classification problem of unbalanced time series data.
【學(xué)位授予單位】:中國(guó)礦業(yè)大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:TP311.13;O211.61
本文編號(hào):2258104
[Abstract]:A time series is a series in which the values of a certain statistical index in different time are formed in order of time. Because the interior of real system or phenomenon is usually affected by many factors, the output time series have many complex manifestations: high dimension, complex structure, noise and similarity deformation. The traditional time series analysis method uses the statistical method to model the time series, but its complex characteristics make it difficult to meet the requirements of the actual system, so the time series research method based on data mining emerges as the times require. It makes time series mining an active research field. Time series classification is an important research content in the field of time series data mining. Its task is to assign a class number to a given time series data by constructing a classifier. As a classification method based on local morphological features, shapelets can make small differences between molecular sequences, thus obtaining good classification effect. It has been applied in many fields, such as medical diagnosis, posture recognition, etc. But there are still problems to be solved. The main research contents of this paper are as follows: (1) aiming at the redundancy of optimal shapelets set in existing classification methods based on shapelets, a time series classification method based on diversified top-k shapelets transformation, (Div Top KShapelet)., is proposed. In this paper, we introduce the diversified top-k query method in the field of data retrieval, propose the concept of diversified top-k shapelets and the corresponding diversified top-k shapelets diagram, process the candidate shapelets, and select the most discriminative and dissimilar shapelets, simultaneously. Using SAX technology to reduce the dimension of the original time series data set. The experimental results show that the proposed method not only has a higher accuracy than the traditional classification method, but also compares with the clustering filtering method (Cluster Shapelet) and the shapelets covering method (Shapelet Selection). The accuracy of classification is increased by 48.43% and 32.61%, and the computational efficiency is improved on all 15 data sets, which accelerates at least 1.09 times and can reach 287.8 times. (2) the existing shapelets classification method can not solve the problem of unbalanced time series classification. A time series classification method, (Div IMShapelet SMOTE)., based on diversified top-k shapelets transformation is proposed. The unbalanced data classification and evaluation index (AUC,) is used to replace the traditional information entropy as the standard to measure shapelets, and the training set is converted by using diversified top-k shapelets. Finally, the transformed training set is oversampled by SMOTE method. In this method, the AUC value is insensitive to unbalanced data, so that the shapelets feature can evaluate the accuracy of classification more accurately. It can not only extract the feature of time series effectively, but also deal with the balance of data set on the basis of feature. The experimental results show that compared with Div Top KShapelet and INOS SVM methods, Div IMShapelet SMOTE has the best effect. The classification accuracy is increased by 38.8% and 10.2% respectively. The maximum increases of 0.37 and 0.08 F-measure are 0.37 and 0.35 and 0.15, respectively, which can effectively deal with the classification problem of unbalanced time series data.
【學(xué)位授予單位】:中國(guó)礦業(yè)大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:TP311.13;O211.61
【參考文獻(xiàn)】
相關(guān)期刊論文 前3條
1 原繼東;王志海;韓萌;;基于Shapelet剪枝和覆蓋的時(shí)間序列分類算法[J];軟件學(xué)報(bào);2015年09期
2 原繼東;王志海;韓萌;游洋;;基于邏輯shapelets轉(zhuǎn)換的時(shí)間序列分類算法[J];計(jì)算機(jī)學(xué)報(bào);2015年07期
3 葉志飛;文益民;呂寶糧;;不平衡分類問(wèn)題研究綜述[J];智能系統(tǒng)學(xué)報(bào);2009年02期
,本文編號(hào):2258104
本文鏈接:http://sikaile.net/kejilunwen/yysx/2258104.html
最近更新
教材專著