基于PAC模型的并行關(guān)聯(lián)分析隨機(jī)算法

發(fā)布時(shí)間：2018-11-26 14:37

【摘要】：隨著大數(shù)據(jù)時(shí)代的來臨以及數(shù)據(jù)集容量的迅速增長,基于并行/分布式計(jì)算的頻繁模式挖掘相比受內(nèi)存和節(jié)點(diǎn)限制的傳統(tǒng)技術(shù)在處理海量數(shù)據(jù)集時(shí)有較為明顯的優(yōu)勢(shì)。正是處于當(dāng)前的背景下,本研究論文提出一個(gè)有效的基于隨機(jī)抽樣的關(guān)聯(lián)分析算法用來從大規(guī)模數(shù)據(jù)集內(nèi)發(fā)現(xiàn)近似頻繁項(xiàng)集。本算法的核心在于選擇一個(gè)數(shù)據(jù)集的隨機(jī)樣本來挖掘近似頻繁項(xiàng)集。標(biāo)準(zhǔn)的Apriori和FP-Growth算法往往需要多次掃描整個(gè)數(shù)據(jù)集來找到具體的頻繁項(xiàng)集,對(duì)于極大數(shù)據(jù)集而言,這必然會(huì)付出相當(dāng)高的存儲(chǔ)和計(jì)算代價(jià)。為了更有效地進(jìn)行大數(shù)據(jù)集挖掘,本文提出兩個(gè)新的基于隨機(jī)抽樣的技術(shù)從事務(wù)型數(shù)據(jù)集樣本中提取高質(zhì)量的近似頻繁項(xiàng)集,并且給予較高的概率保證該近似項(xiàng)集是數(shù)據(jù)集內(nèi)真實(shí)頻繁項(xiàng)集的超集。其中一個(gè)方法應(yīng)用計(jì)算學(xué)習(xí)理論中的經(jīng)驗(yàn)Rademacher復(fù)雜度,結(jié)合集中不等式,來推導(dǎo)出一個(gè)基于數(shù)據(jù)集樣本經(jīng)驗(yàn)Rademacher復(fù)雜度的近似上界,進(jìn)而運(yùn)用統(tǒng)計(jì)學(xué)習(xí)理論中的相關(guān)結(jié)論來找到項(xiàng)集絕對(duì)頻率誤差上確界的一個(gè)近似上界。另一個(gè)方法則直接利用經(jīng)典的Bernstein不等式的變形來限定項(xiàng)集的絕對(duì)頻率誤差。然后將PAC學(xué)習(xí)框架應(yīng)用于這兩種頻繁項(xiàng)集抽樣方法的分析中,通過頻繁項(xiàng)集的(?,δ)-近似來推導(dǎo)出各自的滿足PAC樣本完整性的近似樣本邊界,最后算法根據(jù)該邊界值選取相應(yīng)的數(shù)據(jù)集樣本來挖掘近似頻繁項(xiàng)集。自頻繁項(xiàng)集發(fā)現(xiàn)問題被提出以來,后續(xù)有許多文獻(xiàn)致力于研究基于抽樣技術(shù)的頻繁項(xiàng)集發(fā)現(xiàn)方法。同時(shí)隨著單機(jī)串行計(jì)算瓶頸的出現(xiàn),越來越多的研究工作開展基于并行/分布式計(jì)算平臺(tái)的頻繁項(xiàng)集發(fā)現(xiàn)任務(wù)。本研究論文的擴(kuò)展實(shí)驗(yàn)既采用真實(shí)的retail數(shù)據(jù)集,也包括人造數(shù)據(jù)集T10I4D100K,分別從數(shù)據(jù)集樣本挖掘?qū)嶒?yàn)中返回的近似頻繁項(xiàng)集的召回率,準(zhǔn)確率以及頻率誤差估計(jì)三方面來評(píng)估方法的正確性,同時(shí)從串行計(jì)算(基于單機(jī))和并行計(jì)算(基于Hadoop偽分布式平臺(tái))兩方面綜合評(píng)估方法的運(yùn)行時(shí)間,最后對(duì)提出的不同的理論方法的性能以及對(duì)樣本空間的需求進(jìn)行了對(duì)比。根據(jù)目前所了解的知識(shí)情況,本研究首次將經(jīng)驗(yàn)Rademacher復(fù)雜度以及Bernstein不等式應(yīng)用于基于抽樣技術(shù)的頻繁項(xiàng)集發(fā)現(xiàn)問題中,并利用統(tǒng)計(jì)學(xué)習(xí)中的相關(guān)結(jié)論推導(dǎo)出相對(duì)緊湊的理論樣本邊界,該邊界相比現(xiàn)有的Riondato和Upfal提出的基于VC維的頻繁項(xiàng)集發(fā)現(xiàn)理論樣本邊界更為緊湊,因而具有更優(yōu)的執(zhí)行效率。同時(shí),我們研究方法建議的樣本復(fù)雜度是近似誤差倒數(shù)1/?和置信參數(shù)倒數(shù)1/δ的多項(xiàng)式函數(shù),故而也是PAC可學(xué)習(xí)的。我們采用計(jì)算學(xué)習(xí)理論中一些經(jīng)典的概念和技術(shù)來解決一個(gè)實(shí)際問題,這種研究思路或許也可以運(yùn)用到數(shù)據(jù)挖掘的其他方面。
[Abstract]:With the advent of big data era and the rapid growth of data set capacity, frequent pattern mining based on parallel / distributed computing has obvious advantages over traditional memory and node constrained technologies in processing large data sets. Under the current background, this paper proposes an efficient association analysis algorithm based on random sampling to find approximate frequent itemsets from large data sets. The core of this algorithm is to select a random sample of data set to mine approximate frequent itemsets. Standard Apriori and FP-Growth algorithms often need to scan the whole data set several times to find specific frequent itemsets, which is bound to pay a high cost of storage and computation for large data sets. In order to mine big data sets more effectively, this paper proposes two new techniques based on random sampling to extract high quality approximate frequent itemsets from transactional dataset samples. A higher probability is given to ensure that the approximate itemset is a superset of the true frequent itemsets in the dataset. In one method, an approximate upper bound of empirical Rademacher complexity based on data set samples is derived by applying empirical Rademacher complexity in computational learning theory and set inequality. Then an approximate upper bound of the upper bound of absolute frequency error of itemsets is obtained by using the relevant conclusions in statistical learning theory. Another method directly uses the deformation of the classical Bernstein inequality to define the absolute frequency error of the item set. Then the PAC learning framework is applied to the analysis of the two frequent itemsets sampling methods. By using the (?, 未) -approximation of the frequent itemsets, the approximate sample boundaries satisfying the integrity of the PAC samples are derived. Finally, the algorithm selects the corresponding data set samples according to the boundary value to mine approximate frequent itemsets. Since the problem of frequent itemsets discovery has been raised, many literatures have been devoted to the research of frequent itemsets discovery methods based on sampling technology. At the same time, with the emergence of single machine serial computing bottleneck, more and more research work is carried out to find frequent itemsets based on parallel / distributed computing platform. The extended experiment in this paper uses both the real retail data set and the artificial data set T10I4D100K. the recall rate of the approximate frequent itemsets returned from the data set sample mining experiment is obtained respectively. The correctness of the method is evaluated from three aspects: accuracy and frequency error estimation. At the same time, the running time of the method is evaluated comprehensively from two aspects: serial computing (based on single computer) and parallel computing (based on Hadoop pseudo-distributed platform). Finally, the performance of different theoretical methods and the requirements of sample space are compared. According to the current knowledge, the empirical Rademacher complexity and Bernstein inequality are applied to the frequent itemset discovery problem based on sampling technique for the first time. A relatively compact theoretical sample boundary is derived by using the relevant conclusions in statistical learning. This boundary is more compact than the frequent itemsets based on VC dimension proposed by Riondato and Upfal, so it has better execution efficiency. At the same time, the sample complexity proposed by our method is the inverse of the approximate error. And the polynomial function of the reciprocal 1 / 未 of the confidence parameter is also PAC's learnable. We use some classical concepts and techniques in computational learning theory to solve a practical problem, which may also be applied to other aspects of data mining.
【學(xué)位授予單位】：云南財(cái)經(jīng)大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2017
【分類號(hào)】：TP311.13
，

本文編號(hào)：2358832

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2358832.html

上一篇：云技術(shù)及物聯(lián)網(wǎng)在高校公共資源管理中的應(yīng)用
下一篇：學(xué)生體質(zhì)監(jiān)測嵌入式系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級(jí)|國家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于PAC模型的并行關(guān)聯(lián)分析隨機(jī)算法