基于PAC模型的并行關(guān)聯(lián)分析隨機算法
發(fā)布時間:2018-11-26 14:37
【摘要】:隨著大數(shù)據(jù)時代的來臨以及數(shù)據(jù)集容量的迅速增長,基于并行/分布式計算的頻繁模式挖掘相比受內(nèi)存和節(jié)點限制的傳統(tǒng)技術(shù)在處理海量數(shù)據(jù)集時有較為明顯的優(yōu)勢。正是處于當前的背景下,本研究論文提出一個有效的基于隨機抽樣的關(guān)聯(lián)分析算法用來從大規(guī)模數(shù)據(jù)集內(nèi)發(fā)現(xiàn)近似頻繁項集。本算法的核心在于選擇一個數(shù)據(jù)集的隨機樣本來挖掘近似頻繁項集。標準的Apriori和FP-Growth算法往往需要多次掃描整個數(shù)據(jù)集來找到具體的頻繁項集,對于極大數(shù)據(jù)集而言,這必然會付出相當高的存儲和計算代價。為了更有效地進行大數(shù)據(jù)集挖掘,本文提出兩個新的基于隨機抽樣的技術(shù)從事務(wù)型數(shù)據(jù)集樣本中提取高質(zhì)量的近似頻繁項集,并且給予較高的概率保證該近似項集是數(shù)據(jù)集內(nèi)真實頻繁項集的超集。其中一個方法應(yīng)用計算學(xué)習(xí)理論中的經(jīng)驗Rademacher復(fù)雜度,結(jié)合集中不等式,來推導(dǎo)出一個基于數(shù)據(jù)集樣本經(jīng)驗Rademacher復(fù)雜度的近似上界,進而運用統(tǒng)計學(xué)習(xí)理論中的相關(guān)結(jié)論來找到項集絕對頻率誤差上確界的一個近似上界。另一個方法則直接利用經(jīng)典的Bernstein不等式的變形來限定項集的絕對頻率誤差。然后將PAC學(xué)習(xí)框架應(yīng)用于這兩種頻繁項集抽樣方法的分析中,通過頻繁項集的(?,δ)-近似來推導(dǎo)出各自的滿足PAC樣本完整性的近似樣本邊界,最后算法根據(jù)該邊界值選取相應(yīng)的數(shù)據(jù)集樣本來挖掘近似頻繁項集。自頻繁項集發(fā)現(xiàn)問題被提出以來,后續(xù)有許多文獻致力于研究基于抽樣技術(shù)的頻繁項集發(fā)現(xiàn)方法。同時隨著單機串行計算瓶頸的出現(xiàn),越來越多的研究工作開展基于并行/分布式計算平臺的頻繁項集發(fā)現(xiàn)任務(wù)。本研究論文的擴展實驗既采用真實的retail數(shù)據(jù)集,也包括人造數(shù)據(jù)集T10I4D100K,分別從數(shù)據(jù)集樣本挖掘?qū)嶒炛蟹祷氐慕祁l繁項集的召回率,準確率以及頻率誤差估計三方面來評估方法的正確性,同時從串行計算(基于單機)和并行計算(基于Hadoop偽分布式平臺)兩方面綜合評估方法的運行時間,最后對提出的不同的理論方法的性能以及對樣本空間的需求進行了對比。根據(jù)目前所了解的知識情況,本研究首次將經(jīng)驗Rademacher復(fù)雜度以及Bernstein不等式應(yīng)用于基于抽樣技術(shù)的頻繁項集發(fā)現(xiàn)問題中,并利用統(tǒng)計學(xué)習(xí)中的相關(guān)結(jié)論推導(dǎo)出相對緊湊的理論樣本邊界,該邊界相比現(xiàn)有的Riondato和Upfal提出的基于VC維的頻繁項集發(fā)現(xiàn)理論樣本邊界更為緊湊,因而具有更優(yōu)的執(zhí)行效率。同時,我們研究方法建議的樣本復(fù)雜度是近似誤差倒數(shù)1/?和置信參數(shù)倒數(shù)1/δ的多項式函數(shù),故而也是PAC可學(xué)習(xí)的。我們采用計算學(xué)習(xí)理論中一些經(jīng)典的概念和技術(shù)來解決一個實際問題,這種研究思路或許也可以運用到數(shù)據(jù)挖掘的其他方面。
[Abstract]:With the advent of big data era and the rapid growth of data set capacity, frequent pattern mining based on parallel / distributed computing has obvious advantages over traditional memory and node constrained technologies in processing large data sets. Under the current background, this paper proposes an efficient association analysis algorithm based on random sampling to find approximate frequent itemsets from large data sets. The core of this algorithm is to select a random sample of data set to mine approximate frequent itemsets. Standard Apriori and FP-Growth algorithms often need to scan the whole data set several times to find specific frequent itemsets, which is bound to pay a high cost of storage and computation for large data sets. In order to mine big data sets more effectively, this paper proposes two new techniques based on random sampling to extract high quality approximate frequent itemsets from transactional dataset samples. A higher probability is given to ensure that the approximate itemset is a superset of the true frequent itemsets in the dataset. In one method, an approximate upper bound of empirical Rademacher complexity based on data set samples is derived by applying empirical Rademacher complexity in computational learning theory and set inequality. Then an approximate upper bound of the upper bound of absolute frequency error of itemsets is obtained by using the relevant conclusions in statistical learning theory. Another method directly uses the deformation of the classical Bernstein inequality to define the absolute frequency error of the item set. Then the PAC learning framework is applied to the analysis of the two frequent itemsets sampling methods. By using the (?, 未) -approximation of the frequent itemsets, the approximate sample boundaries satisfying the integrity of the PAC samples are derived. Finally, the algorithm selects the corresponding data set samples according to the boundary value to mine approximate frequent itemsets. Since the problem of frequent itemsets discovery has been raised, many literatures have been devoted to the research of frequent itemsets discovery methods based on sampling technology. At the same time, with the emergence of single machine serial computing bottleneck, more and more research work is carried out to find frequent itemsets based on parallel / distributed computing platform. The extended experiment in this paper uses both the real retail data set and the artificial data set T10I4D100K. the recall rate of the approximate frequent itemsets returned from the data set sample mining experiment is obtained respectively. The correctness of the method is evaluated from three aspects: accuracy and frequency error estimation. At the same time, the running time of the method is evaluated comprehensively from two aspects: serial computing (based on single computer) and parallel computing (based on Hadoop pseudo-distributed platform). Finally, the performance of different theoretical methods and the requirements of sample space are compared. According to the current knowledge, the empirical Rademacher complexity and Bernstein inequality are applied to the frequent itemset discovery problem based on sampling technique for the first time. A relatively compact theoretical sample boundary is derived by using the relevant conclusions in statistical learning. This boundary is more compact than the frequent itemsets based on VC dimension proposed by Riondato and Upfal, so it has better execution efficiency. At the same time, the sample complexity proposed by our method is the inverse of the approximate error. And the polynomial function of the reciprocal 1 / 未 of the confidence parameter is also PAC's learnable. We use some classical concepts and techniques in computational learning theory to solve a practical problem, which may also be applied to other aspects of data mining.
【學(xué)位授予單位】:云南財經(jīng)大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2017
【分類號】:TP311.13
,
本文編號:2358832
[Abstract]:With the advent of big data era and the rapid growth of data set capacity, frequent pattern mining based on parallel / distributed computing has obvious advantages over traditional memory and node constrained technologies in processing large data sets. Under the current background, this paper proposes an efficient association analysis algorithm based on random sampling to find approximate frequent itemsets from large data sets. The core of this algorithm is to select a random sample of data set to mine approximate frequent itemsets. Standard Apriori and FP-Growth algorithms often need to scan the whole data set several times to find specific frequent itemsets, which is bound to pay a high cost of storage and computation for large data sets. In order to mine big data sets more effectively, this paper proposes two new techniques based on random sampling to extract high quality approximate frequent itemsets from transactional dataset samples. A higher probability is given to ensure that the approximate itemset is a superset of the true frequent itemsets in the dataset. In one method, an approximate upper bound of empirical Rademacher complexity based on data set samples is derived by applying empirical Rademacher complexity in computational learning theory and set inequality. Then an approximate upper bound of the upper bound of absolute frequency error of itemsets is obtained by using the relevant conclusions in statistical learning theory. Another method directly uses the deformation of the classical Bernstein inequality to define the absolute frequency error of the item set. Then the PAC learning framework is applied to the analysis of the two frequent itemsets sampling methods. By using the (?, 未) -approximation of the frequent itemsets, the approximate sample boundaries satisfying the integrity of the PAC samples are derived. Finally, the algorithm selects the corresponding data set samples according to the boundary value to mine approximate frequent itemsets. Since the problem of frequent itemsets discovery has been raised, many literatures have been devoted to the research of frequent itemsets discovery methods based on sampling technology. At the same time, with the emergence of single machine serial computing bottleneck, more and more research work is carried out to find frequent itemsets based on parallel / distributed computing platform. The extended experiment in this paper uses both the real retail data set and the artificial data set T10I4D100K. the recall rate of the approximate frequent itemsets returned from the data set sample mining experiment is obtained respectively. The correctness of the method is evaluated from three aspects: accuracy and frequency error estimation. At the same time, the running time of the method is evaluated comprehensively from two aspects: serial computing (based on single computer) and parallel computing (based on Hadoop pseudo-distributed platform). Finally, the performance of different theoretical methods and the requirements of sample space are compared. According to the current knowledge, the empirical Rademacher complexity and Bernstein inequality are applied to the frequent itemset discovery problem based on sampling technique for the first time. A relatively compact theoretical sample boundary is derived by using the relevant conclusions in statistical learning. This boundary is more compact than the frequent itemsets based on VC dimension proposed by Riondato and Upfal, so it has better execution efficiency. At the same time, the sample complexity proposed by our method is the inverse of the approximate error. And the polynomial function of the reciprocal 1 / 未 of the confidence parameter is also PAC's learnable. We use some classical concepts and techniques in computational learning theory to solve a practical problem, which may also be applied to other aspects of data mining.
【學(xué)位授予單位】:云南財經(jīng)大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2017
【分類號】:TP311.13
,
本文編號:2358832
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2358832.html
最近更新
教材專著