基于MapReduce的分布式快速聚類算法研究
本文關(guān)鍵詞: 海量數(shù)據(jù) 聚類 MapReduce 并行計算 數(shù)據(jù)挖掘 出處:《東北電力大學(xué)》2017年碩士論文 論文類型:學(xué)位論文
【摘要】:隨著信息技術(shù)的高速發(fā)展,數(shù)據(jù)規(guī)模呈現(xiàn)指數(shù)級增長態(tài)勢,傳統(tǒng)聚類算法面臨巨大的挑戰(zhàn)。一是海量數(shù)據(jù)內(nèi)的噪聲雜、冗余度高、價值密度低,聚類算法的準(zhǔn)確率不高;二是串行聚類算法面對海量數(shù)據(jù)時,搜索鄰域代價巨大,執(zhí)行效率無法適應(yīng)實際需求。針對上述問題,本文充分分析數(shù)據(jù)特點,基于MapReduce大數(shù)據(jù)處理框架,設(shè)計了分布式快速聚類算法,實現(xiàn)了高效、高精度的并行數(shù)據(jù)聚類。針對海量數(shù)據(jù)中冗余度高,無價值數(shù)據(jù)繁多的問題,本文提出一種基于MapReduce的分布式數(shù)據(jù)約減算法。通過一種新的抽樣算法計算數(shù)據(jù)點的矩形域和抽樣域,并在抽樣域中確定樣本數(shù)據(jù),然后對樣本數(shù)據(jù)進(jìn)行擴(kuò)展抽樣來達(dá)到約減原始數(shù)據(jù)集的目的,最后提出一種代表性驗證策略來檢驗樣本集,從而解決海量數(shù)據(jù)聚類產(chǎn)生巨大I/O開銷和網(wǎng)絡(luò)開銷的問題。針對搜索最近鄰代價消耗大,聚類執(zhí)行效率低的問題,本文利用Map任務(wù)對樣本數(shù)據(jù)集進(jìn)行相等大小的數(shù)據(jù)劃分,Reduce任務(wù)對數(shù)據(jù)子集進(jìn)行局部密度聚類,因此針對單節(jié)點提出基于擴(kuò)展區(qū)域查詢的密度聚類算法。首先通過基于固定網(wǎng)格的擴(kuò)展區(qū)域查詢方法,確定數(shù)據(jù)點最近鄰和反最近鄰的鄰域關(guān)系,建立每個數(shù)據(jù)點的影響空間域,然后提出異常點判定函數(shù),使算法能夠準(zhǔn)確地識別噪聲點和邊界點。Reduce聚類任務(wù)結(jié)束后輸出局部聚類結(jié)果,為得到面向整個數(shù)據(jù)集的全局聚類結(jié)果,本文提出一種基于簇間距離的局部類簇合并算法,通過簇間距離的計算確定局部類簇間的分布關(guān)系,得到可以兩兩合并的局部類簇對,然后根據(jù)連通子圖發(fā)現(xiàn)方法合并局部類簇對,最后輸出全局聚類結(jié)果。實驗結(jié)果表明,本文提出的算法有效地將海量數(shù)據(jù)進(jìn)行約減,保證了樣本數(shù)據(jù)與原始數(shù)據(jù)分布的一致性,在信息量無損失的前提下降低了數(shù)據(jù)冗余,并且該算法能夠快速處理任意形狀的類簇,大幅度提高了算法的執(zhí)行效率和聚類質(zhì)量。
[Abstract]:With the rapid development of information technology, the scale of data is increasing exponentially. The traditional clustering algorithm is faced with great challenges. One is the noise clutter, high redundancy and low value density in massive data. The accuracy of clustering algorithm is not high; Second, the serial clustering algorithm in the face of massive data, the search cost is huge, the execution efficiency can not meet the actual needs. In view of the above problems, this paper fully analyzes the characteristics of the data. Based on the MapReduce big data processing framework, a distributed fast clustering algorithm is designed to achieve efficient and accurate parallel data clustering. In this paper, a distributed data reduction algorithm based on MapReduce is proposed, and a new sampling algorithm is proposed to calculate the rectangular and sampling regions of data points. The sample data is determined in the sampling domain, and then the sample data is extended to reduce the original data set. Finally, a representative validation strategy is proposed to test the sample set. In order to solve the problem of huge I / O overhead and network overhead caused by massive data clustering, aiming at the problem of high cost of searching nearest neighbor and low efficiency of clustering execution. In this paper, the Map task is used to partition the sample data set with equal size and reduce task to cluster the local density of the data subset. Therefore, a density clustering algorithm based on extended region query is proposed for single node. Firstly, the neighborhood relationship between nearest neighbor and anti-nearest neighbor of data point is determined by the extended region query method based on fixed grid. The influence spatial domain of each data point is established, and the outlier decision function is proposed, which enables the algorithm to accurately identify the noise point and the boundary point. In order to obtain the global clustering results for the whole data set, this paper proposes a local cluster merging algorithm based on the distance between clusters, which determines the distribution of local clusters by calculating the distance between clusters. The local cluster pairs which can be merged by pairwise are obtained, and the local cluster pairs are merged according to the connected subgraph discovery method. Finally, the global clustering results are outputted. The experimental results show that. The algorithm proposed in this paper can effectively reduce the mass data, ensure the consistency of the distribution between the sample data and the original data, and reduce the data redundancy without loss of information. Moreover, the algorithm can deal with arbitrary clusters quickly, which greatly improves the efficiency and clustering quality of the algorithm.
【學(xué)位授予單位】:東北電力大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2017
【分類號】:TP311.13
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 韓巖;李曉;;加速大數(shù)據(jù)聚類K-means算法的改進(jìn)[J];計算機(jī)工程與設(shè)計;2015年05期
2 于彥偉;王歡;王沁;趙金東;;面向海量數(shù)據(jù)流的基于密度的簇結(jié)構(gòu)挖掘算法[J];軟件學(xué)報;2015年05期
3 程學(xué)旗;靳小龍;王元卓;郭嘉豐;張鐵贏;李國杰;;大數(shù)據(jù)系統(tǒng)和分析技術(shù)綜述[J];軟件學(xué)報;2014年09期
4 陸穎華;馬廷淮;鐘水明;曹杰;王新;Abdullah Al-Dhelaane;;Improved locality-sensitive hashing method for the approximate nearest neighbor problem[J];Chinese Physics B;2014年08期
5 崔穎安;李雪;王志曉;張德運;;在線社交媒體數(shù)據(jù)抽樣方法的比較研究[J];計算機(jī)學(xué)報;2014年08期
6 顧榮;嚴(yán)金雙;楊曉亮;袁春風(fēng);黃宜華;;Hadoop MapReduce短作業(yè)執(zhí)行性能優(yōu)化[J];計算機(jī)研究與發(fā)展;2014年06期
7 崔穎安;李雪;王志曉;張德運;;社會化媒體大數(shù)據(jù)多階段整群抽樣方法[J];軟件學(xué)報;2014年04期
8 張引;陳敏;廖小飛;;大數(shù)據(jù)應(yīng)用的現(xiàn)狀與展望[J];計算機(jī)研究與發(fā)展;2013年S2期
9 劉淑芬;孟冬雪;王曉燕;;基于網(wǎng)格單元的DBSCAN算法[J];吉林大學(xué)學(xué)報(工學(xué)版);2014年04期
10 劉義;景寧;陳犖;熊偉;;MapReduce框架下基于R-樹的k-近鄰連接算法[J];軟件學(xué)報;2013年08期
,本文編號:1457393
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1457393.html