多數(shù)據(jù)流頻繁項(xiàng)集挖掘算法研究
本文選題:數(shù)據(jù)挖掘 + 多數(shù)據(jù)流; 參考:《山東師范大學(xué)》2017年碩士論文
【摘要】:隨著互聯(lián)網(wǎng)技術(shù)在眾多領(lǐng)域飛速地發(fā)展,網(wǎng)絡(luò)數(shù)據(jù)的存在形式也呈現(xiàn)出多樣化的趨勢(shì)。其中,數(shù)據(jù)流作為一種新型的數(shù)據(jù)形式已在眾多應(yīng)用領(lǐng)域廣泛地出現(xiàn)。例如,傳感器網(wǎng)絡(luò)環(huán)境中的數(shù)據(jù)、金融應(yīng)用中的財(cái)務(wù)數(shù)據(jù)和GPS定位系統(tǒng)所獲取的地理位置等數(shù)據(jù)。面對(duì)無(wú)限、連續(xù)和高速的海量數(shù)據(jù),傳統(tǒng)的數(shù)據(jù)挖掘技術(shù)難以直接應(yīng)用于發(fā)現(xiàn)海量數(shù)據(jù)流中的有效信息。因此,數(shù)據(jù)流挖掘問(wèn)題具有重要的研究意義。本文將多數(shù)據(jù)流頻繁項(xiàng)集挖掘算法作為研究對(duì)象。首先,闡述了課題的研究背景以及研究意義,同時(shí)概括總結(jié)了國(guó)內(nèi)外關(guān)于該課題的研究現(xiàn)狀。其次,闡述了在數(shù)據(jù)處理過(guò)程中所應(yīng)用的相關(guān)技術(shù)。最后,提出了兩種基于多數(shù)據(jù)流環(huán)境的頻繁項(xiàng)集挖掘算法。本文的主要工作可分為以下三個(gè)方面:(1)研究了多數(shù)據(jù)流頻繁項(xiàng)集挖掘算法的數(shù)據(jù)存儲(chǔ)結(jié)構(gòu),設(shè)計(jì)了一種基于FP-Tree的壓縮頻繁模式樹(shù)。本文對(duì)數(shù)據(jù)流的特點(diǎn)和表現(xiàn)形式進(jìn)行了深入地分析研究,設(shè)計(jì)了一種基于字典序列的前綴樹(shù)存儲(chǔ)結(jié)構(gòu),并在該結(jié)構(gòu)中引入了對(duì)數(shù)傾斜時(shí)間窗口模型。該窗口模型能夠增量地更新、保留頻繁項(xiàng)集的計(jì)數(shù)值,在一定程度上提高了內(nèi)存空間的利用率以及算法的空間復(fù)雜度。(2)研究了多數(shù)據(jù)流協(xié)同頻繁項(xiàng)集挖掘問(wèn)題,改進(jìn)了一種基于滑動(dòng)窗口模型的多數(shù)據(jù)流協(xié)同頻繁項(xiàng)集挖掘算法。本文引入了多數(shù)據(jù)流協(xié)同頻繁項(xiàng)集挖掘問(wèn)題,多數(shù)據(jù)流協(xié)同頻繁項(xiàng)集是指一組對(duì)象在很短的時(shí)間內(nèi)以伴隨的狀態(tài)頻繁地出現(xiàn)在一條數(shù)據(jù)流或多條數(shù)據(jù)流中。首先,通過(guò)基于字節(jié)序列的滑動(dòng)窗口挖掘算法發(fā)現(xiàn)數(shù)據(jù)流中的潛在頻繁項(xiàng)集和頻繁項(xiàng)集;其次,構(gòu)建頻繁模式樹(shù)用以存儲(chǔ)多數(shù)據(jù)流中的潛在頻繁項(xiàng)集和頻繁項(xiàng)集,并增量地更新樹(shù)結(jié)構(gòu)中對(duì)數(shù)傾斜時(shí)間表內(nèi)對(duì)應(yīng)項(xiàng)集出現(xiàn)的頻數(shù);最后,通過(guò)匯總分析得出多數(shù)據(jù)流中的協(xié)同頻繁項(xiàng)集。(3)研究了分布式環(huán)境中的多數(shù)據(jù)流協(xié)同頻繁項(xiàng)集挖掘算法,將多數(shù)據(jù)流協(xié)同頻繁項(xiàng)集挖掘算法并行化計(jì)算。在當(dāng)前的大數(shù)據(jù)背景下,數(shù)據(jù)流的規(guī)模呈現(xiàn)急劇增長(zhǎng)的趨勢(shì),其到達(dá)速度非?烨覍(duì)處理結(jié)果的實(shí)時(shí)性要求非常高。單個(gè)計(jì)算節(jié)點(diǎn)的計(jì)算能力難以承受規(guī)模如此巨大的數(shù)據(jù)。因此,傳統(tǒng)的集中式頻繁項(xiàng)集挖掘算法無(wú)法應(yīng)對(duì)規(guī)模日益劇增的數(shù)據(jù)流。為了解決這一問(wèn)題,本文采用了并行計(jì)算模型這一有效的途徑,還設(shè)計(jì)了能夠分布到不同計(jì)算節(jié)點(diǎn)上的分布式索引結(jié)構(gòu),能夠高效地發(fā)現(xiàn)存在于分布式環(huán)境中多數(shù)據(jù)流的協(xié)同頻繁項(xiàng)集。
[Abstract]:With the rapid development of Internet technology in many fields, the existing form of network data also presents a trend of diversification. As a new form of data, data flow has been widely used in many applications. For example, data in sensor network environment, financial data in financial applications and GPS positioning system In the face of infinite, continuous and high-speed mass data, the traditional data mining technology is difficult to directly apply to the discovery of effective information in the mass data stream. Therefore, the data stream mining problem has an important research significance. In this paper, the frequent item set mining algorithm of multi data flow is used as the research object. First, the topic is expounded. The research background and significance of the research are summarized, and the research status about this topic at home and abroad is summarized. Secondly, the related technologies used in the process of data processing are expounded. Finally, two kinds of frequent itemset mining algorithms based on multi data stream environment are proposed. The main work of this paper can be divided into three aspects: (1) many studies are made. The data stream frequent itemset mining algorithm is a data storage structure, and a FP-Tree based compression frequent pattern tree is designed. In this paper, the characteristics and forms of the data flow are deeply analyzed and studied. A prefix tree storage structure based on the dictionary sequence is designed, and the log skew time window model is introduced in this structure. The window model can be updated incrementally, retain the number of frequent itemsets, improve the utilization of memory space and the spatial complexity of the algorithm to a certain extent. (2) the problem of multi data stream co frequent itemset mining is studied, and a multi data stream cooperative frequent itemset mining algorithm based on sliding window model is improved. Multi data stream co frequent itemsets mining, multi data stream synergetic frequent itemsets are the frequent occurrence of a group of objects in a very short time in a data stream or multiple data streams. First, the potential frequent itemsets and frequent itemsets in the data stream are found through the sliding window mining algorithm based on the byte sequence. Secondly, the frequent pattern tree is constructed to store the potential frequent itemsets and frequent itemsets in the multi data stream, and incrementally update the frequency of the corresponding item set in the log sloping timetable in the tree structure. Finally, the synergetic frequent itemsets in the multi data stream are obtained by the summary analysis. (3) the multi data flow Association in the distributed environment is studied. With the frequent itemset mining algorithm, the multi data stream co frequent itemset mining algorithm is parallelized. In the current large data background, the scale of the data flow presents a rapid growth trend, its arrival speed is very fast and the real-time performance of the processing results is very high. The computing ability of single computing nodes is difficult to bear the size of such a huge amount. Therefore, the traditional centralized frequent itemset mining algorithm can not cope with the increasing scale of data flow. In order to solve this problem, this paper uses the parallel computing model as an effective way, and designs a distributed index structure that can be distributed to different computing nodes, and can efficiently find the distributed loop. Synergetic frequent itemsets of multiple data streams in the border.
【學(xué)位授予單位】:山東師范大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類(lèi)號(hào)】:TP311.13
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 王鑫;劉方愛(ài);;改進(jìn)的多數(shù)據(jù)流協(xié)同頻繁項(xiàng)集挖掘算法[J];計(jì)算機(jī)應(yīng)用;2016年07期
2 米允龍;米春橋;劉文奇;;海量數(shù)據(jù)挖掘過(guò)程相關(guān)技術(shù)研究進(jìn)展[J];計(jì)算機(jī)科學(xué)與探索;2015年06期
3 江雨燕;李平;;基于PFP-Growth算法的海量頻繁項(xiàng)集挖掘[J];計(jì)算機(jī)技術(shù) 與發(fā)展;2013年09期
4 毛伊敏;陳志剛;;在線(xiàn)挖掘數(shù)據(jù)流閉頻繁項(xiàng)集的高效算法[J];計(jì)算機(jī)科學(xué);2013年02期
5 李海峰;章寧;朱建明;曹懷虎;;時(shí)間敏感數(shù)據(jù)流上的頻繁項(xiàng)集挖掘算法[J];計(jì)算機(jī)學(xué)報(bào);2012年11期
6 王爽;王國(guó)仁;;基于滑動(dòng)窗口的Top-K概率頻繁項(xiàng)查詢(xún)算法研究[J];計(jì)算機(jī)研究與發(fā)展;2012年10期
7 李建江;崔健;王聃;嚴(yán)林;黃義雙;;MapReduce并行編程模型研究綜述[J];電子學(xué)報(bào);2011年11期
8 彭高輝;王志良;;數(shù)據(jù)挖掘中的數(shù)據(jù)預(yù)處理方法[J];華北水利水電學(xué)院學(xué)報(bào);2008年06期
9 李國(guó)徽;陳輝;;挖掘數(shù)據(jù)流任意滑動(dòng)時(shí)間窗口內(nèi)頻繁模式[J];軟件學(xué)報(bào);2008年10期
10 孫玉芬;盧炎生;;流數(shù)據(jù)挖掘綜述[J];計(jì)算機(jī)科學(xué);2007年01期
相關(guān)博士學(xué)位論文 前5條
1 于自強(qiáng);海量流數(shù)據(jù)挖掘相關(guān)問(wèn)題研究[D];山東大學(xué);2015年
2 李秋虹;基于MapReduce的大規(guī)模數(shù)據(jù)挖掘技術(shù)研究[D];復(fù)旦大學(xué);2013年
3 王樂(lè);數(shù)據(jù)流模式挖掘算法及應(yīng)用研究[D];大連理工大學(xué);2013年
4 倪萍;流數(shù)據(jù)挖掘關(guān)鍵技術(shù)研究[D];北京郵電大學(xué);2010年
5 屠莉;流數(shù)據(jù)的頻繁項(xiàng)挖掘及聚類(lèi)的關(guān)鍵技術(shù)研究[D];南京航空航天大學(xué);2009年
相關(guān)碩士學(xué)位論文 前8條
1 劉士佳;基于MapReduce框架的頻繁項(xiàng)集挖掘算法研究[D];哈爾濱理工大學(xué);2015年
2 白川平;數(shù)據(jù)流頻繁項(xiàng)集挖掘算法的研究[D];蘭州理工大學(xué);2014年
3 劉宇;基于云計(jì)算的聚類(lèi)挖掘算法及其應(yīng)用研究[D];南京郵電大學(xué);2014年
4 呂春陽(yáng);面向數(shù)據(jù)流的Top-k頻繁閉項(xiàng)集挖掘算法研究[D];哈爾濱工程大學(xué);2012年
5 李彥偉;基于關(guān)聯(lián)規(guī)則的數(shù)據(jù)挖掘方法研究[D];江南大學(xué);2011年
6 白云龍;基于Hadoop的數(shù)據(jù)挖掘算法研究與實(shí)現(xiàn)[D];北京郵電大學(xué);2011年
7 姜文;基于Hadoop平臺(tái)的數(shù)據(jù)分析和應(yīng)用[D];北京郵電大學(xué);2011年
8 姜軍曉;一種流數(shù)據(jù)頻繁模式挖掘算法的研究與實(shí)現(xiàn)[D];大連理工大學(xué);2007年
,本文編號(hào):1894465
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1894465.html