基于相關(guān)子空間的上下文離群數(shù)據(jù)并行挖掘
本文選題:離群數(shù)據(jù) + 上下文信息。 參考:《太原科技大學(xué)》2017年碩士論文
【摘要】:離群數(shù)據(jù)是數(shù)據(jù)挖掘領(lǐng)域的一個(gè)重要研究?jī)?nèi)容,指的是在給定的數(shù)據(jù)集中,與其他大部分?jǐn)?shù)據(jù)的特征不一致,有明顯差異的數(shù)據(jù)。隨著數(shù)據(jù)量和數(shù)據(jù)維度的爆炸式增長(zhǎng),傳統(tǒng)的離群數(shù)據(jù)挖掘算法效率低的缺點(diǎn)凸顯出來,難以適用于海量高維數(shù)據(jù)集。此外,傳統(tǒng)的離群數(shù)據(jù)挖掘一般只注重于挖掘的效率和精度,而對(duì)于其挖掘結(jié)果的可解釋性和可理解性研究相對(duì)較少,導(dǎo)致離群數(shù)據(jù)難以理解。本文采用相關(guān)子空間,對(duì)上下文離群數(shù)據(jù)并行挖掘方法進(jìn)行了較深入研究。其主要研究成果如下:(1)給出一種MapReduce編程模型下的上下文離群數(shù)據(jù)挖掘算法。該算法利用局部稀疏差異度,確定數(shù)據(jù)對(duì)象的相關(guān)子空間,并計(jì)算該數(shù)據(jù)對(duì)象在該相關(guān)子空間下的離群因子;將其離群因子和相關(guān)子空間中相關(guān)屬性維集定義為數(shù)據(jù)對(duì)象的上下文信息;選取離群因子最大的N個(gè)數(shù)據(jù)對(duì)象,作為上下文離群數(shù)據(jù);利用MapReduce編程模型,給出了一種上下文離群數(shù)據(jù)并行挖掘算法;最后,在UCI數(shù)據(jù)集上,實(shí)驗(yàn)驗(yàn)證了該算法所具有的上下文信息,能有效地提高離群數(shù)據(jù)的可解釋性和可理解性。(2)采用Spark內(nèi)存計(jì)算平臺(tái),給出了一種基于相關(guān)子空間的上下文離群數(shù)據(jù)并行挖掘算法。該算法借助于彈性分布式數(shù)據(jù)集(RDD),將K近鄰集、局部稀疏度矩陣與局部稀疏差異度矩陣等保留在內(nèi)存中,從而有效地提高了離群數(shù)據(jù)挖掘效率,降低了I/O代價(jià)。采用天體光譜數(shù)據(jù)集,實(shí)驗(yàn)驗(yàn)證了該算法在Spark內(nèi)存計(jì)算平臺(tái)下,具有良好的可伸縮性和可擴(kuò)展性。
[Abstract]:Outlier data is an important research content in the field of data mining. It refers to the data that is different from most other data in a given data set. With the explosive growth of data volume and data dimension, the shortcomings of traditional outlier data mining algorithm are highlighted, and it is difficult to apply to mass high-dimensional data sets. In addition, the traditional outlier data mining only focuses on the efficiency and precision of mining, but there are few researches on the interpretability and comprehensibility of the results of outlier mining, which leads to the outlier data being difficult to understand. In this paper, the parallel mining method of contextual outlier data is studied by using the correlation subspace. The main research results are as follows: 1) A context outlier mining algorithm based on MapReduce programming model is presented. The algorithm uses the local sparse difference to determine the correlation subspace of the data object, and calculates the outlier factor of the data object under the correlation subspace. The dimension set of outliers and related attributes in the correlation subspace is defined as the context information of the data objects; the N data objects with the largest outliers are selected as the contextual outliers; and the MapReduce programming model is used. A parallel mining algorithm for contextual outlier data is presented. Finally, the context information of the algorithm is verified by experiments on the UCI dataset. It can effectively improve the interpretability and comprehensibility of outlier data. Using the Spark memory computing platform, a parallel mining algorithm of contextual outlier data based on correlation subspace is presented. With the help of the elastic distributed data set (RDDN), the K-nearest neighbor set, the local sparsity matrix and the local sparsity difference matrix are kept in memory, which effectively improves the efficiency of outlier data mining and reduces the I / O cost. The experimental results show that the proposed algorithm is scalable and scalable on Spark memory computing platform.
【學(xué)位授予單位】:太原科技大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:TP311.13
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 郭慈;廖振松;;基于Spark核心架構(gòu)的大數(shù)據(jù)平臺(tái)技術(shù)研究與實(shí)踐[J];電信工程技術(shù)與標(biāo)準(zhǔn)化;2016年10期
2 羅樂;劉軼;錢德沛;;內(nèi)存計(jì)算技術(shù)研究綜述[J];軟件學(xué)報(bào);2016年08期
3 王也;張繼福;趙旭俊;;基于微粒群算法的上下文離群數(shù)據(jù)挖掘算法[J];太原科技大學(xué)學(xué)報(bào);2015年05期
4 楊彬;;一種基于RFM模型數(shù)據(jù)挖掘處理雙階段客戶關(guān)聯(lián)分類方法[J];統(tǒng)計(jì)與決策;2015年07期
5 張繼福;李永紅;秦嘯;荀亞玲;;基于MapReduce與相關(guān)子空間的局部離群數(shù)據(jù)挖掘算法[J];軟件學(xué)報(bào);2015年05期
6 趙旭俊;蔡江輝;張繼福;楊海峰;馬洋;;基于分類模式樹的恒星光譜自動(dòng)分類方法[J];光譜學(xué)與光譜分析;2013年10期
7 劉義;景寧;陳犖;熊偉;;MapReduce框架下基于R-樹的k-近鄰連接算法[J];軟件學(xué)報(bào);2013年08期
8 李俊;黃春毅;;關(guān)聯(lián)數(shù)據(jù)的知識(shí)發(fā)現(xiàn)研究[J];情報(bào)科學(xué);2013年03期
9 張曉華;繆裕青;蘇杰;吳孔玲;;垂直分布下的隱私保護(hù)關(guān)聯(lián)規(guī)則挖掘[J];計(jì)算機(jī)工程與設(shè)計(jì);2012年05期
10 何波;;基于頻繁模式樹的分布式關(guān)聯(lián)規(guī)則挖掘算法[J];控制與決策;2012年04期
,本文編號(hào):1802706
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1802706.html