基于MapReduce的分布式快速聚類算法研究
本文關鍵詞: 海量數據 聚類 MapReduce 并行計算 數據挖掘 出處:《東北電力大學》2017年碩士論文 論文類型:學位論文
【摘要】:隨著信息技術的高速發(fā)展,數據規(guī)模呈現指數級增長態(tài)勢,傳統(tǒng)聚類算法面臨巨大的挑戰(zhàn)。一是海量數據內的噪聲雜、冗余度高、價值密度低,聚類算法的準確率不高;二是串行聚類算法面對海量數據時,搜索鄰域代價巨大,執(zhí)行效率無法適應實際需求。針對上述問題,本文充分分析數據特點,基于MapReduce大數據處理框架,設計了分布式快速聚類算法,實現了高效、高精度的并行數據聚類。針對海量數據中冗余度高,無價值數據繁多的問題,本文提出一種基于MapReduce的分布式數據約減算法。通過一種新的抽樣算法計算數據點的矩形域和抽樣域,并在抽樣域中確定樣本數據,然后對樣本數據進行擴展抽樣來達到約減原始數據集的目的,最后提出一種代表性驗證策略來檢驗樣本集,從而解決海量數據聚類產生巨大I/O開銷和網絡開銷的問題。針對搜索最近鄰代價消耗大,聚類執(zhí)行效率低的問題,本文利用Map任務對樣本數據集進行相等大小的數據劃分,Reduce任務對數據子集進行局部密度聚類,因此針對單節(jié)點提出基于擴展區(qū)域查詢的密度聚類算法。首先通過基于固定網格的擴展區(qū)域查詢方法,確定數據點最近鄰和反最近鄰的鄰域關系,建立每個數據點的影響空間域,然后提出異常點判定函數,使算法能夠準確地識別噪聲點和邊界點。Reduce聚類任務結束后輸出局部聚類結果,為得到面向整個數據集的全局聚類結果,本文提出一種基于簇間距離的局部類簇合并算法,通過簇間距離的計算確定局部類簇間的分布關系,得到可以兩兩合并的局部類簇對,然后根據連通子圖發(fā)現方法合并局部類簇對,最后輸出全局聚類結果。實驗結果表明,本文提出的算法有效地將海量數據進行約減,保證了樣本數據與原始數據分布的一致性,在信息量無損失的前提下降低了數據冗余,并且該算法能夠快速處理任意形狀的類簇,大幅度提高了算法的執(zhí)行效率和聚類質量。
[Abstract]:With the rapid development of information technology, the scale of data is increasing exponentially. The traditional clustering algorithm is faced with great challenges. One is the noise clutter, high redundancy and low value density in massive data. The accuracy of clustering algorithm is not high; Second, the serial clustering algorithm in the face of massive data, the search cost is huge, the execution efficiency can not meet the actual needs. In view of the above problems, this paper fully analyzes the characteristics of the data. Based on the MapReduce big data processing framework, a distributed fast clustering algorithm is designed to achieve efficient and accurate parallel data clustering. In this paper, a distributed data reduction algorithm based on MapReduce is proposed, and a new sampling algorithm is proposed to calculate the rectangular and sampling regions of data points. The sample data is determined in the sampling domain, and then the sample data is extended to reduce the original data set. Finally, a representative validation strategy is proposed to test the sample set. In order to solve the problem of huge I / O overhead and network overhead caused by massive data clustering, aiming at the problem of high cost of searching nearest neighbor and low efficiency of clustering execution. In this paper, the Map task is used to partition the sample data set with equal size and reduce task to cluster the local density of the data subset. Therefore, a density clustering algorithm based on extended region query is proposed for single node. Firstly, the neighborhood relationship between nearest neighbor and anti-nearest neighbor of data point is determined by the extended region query method based on fixed grid. The influence spatial domain of each data point is established, and the outlier decision function is proposed, which enables the algorithm to accurately identify the noise point and the boundary point. In order to obtain the global clustering results for the whole data set, this paper proposes a local cluster merging algorithm based on the distance between clusters, which determines the distribution of local clusters by calculating the distance between clusters. The local cluster pairs which can be merged by pairwise are obtained, and the local cluster pairs are merged according to the connected subgraph discovery method. Finally, the global clustering results are outputted. The experimental results show that. The algorithm proposed in this paper can effectively reduce the mass data, ensure the consistency of the distribution between the sample data and the original data, and reduce the data redundancy without loss of information. Moreover, the algorithm can deal with arbitrary clusters quickly, which greatly improves the efficiency and clustering quality of the algorithm.
【學位授予單位】:東北電力大學
【學位級別】:碩士
【學位授予年份】:2017
【分類號】:TP311.13
【參考文獻】
相關期刊論文 前10條
1 韓巖;李曉;;加速大數據聚類K-means算法的改進[J];計算機工程與設計;2015年05期
2 于彥偉;王歡;王沁;趙金東;;面向海量數據流的基于密度的簇結構挖掘算法[J];軟件學報;2015年05期
3 程學旗;靳小龍;王元卓;郭嘉豐;張鐵贏;李國杰;;大數據系統(tǒng)和分析技術綜述[J];軟件學報;2014年09期
4 陸穎華;馬廷淮;鐘水明;曹杰;王新;Abdullah Al-Dhelaane;;Improved locality-sensitive hashing method for the approximate nearest neighbor problem[J];Chinese Physics B;2014年08期
5 崔穎安;李雪;王志曉;張德運;;在線社交媒體數據抽樣方法的比較研究[J];計算機學報;2014年08期
6 顧榮;嚴金雙;楊曉亮;袁春風;黃宜華;;Hadoop MapReduce短作業(yè)執(zhí)行性能優(yōu)化[J];計算機研究與發(fā)展;2014年06期
7 崔穎安;李雪;王志曉;張德運;;社會化媒體大數據多階段整群抽樣方法[J];軟件學報;2014年04期
8 張引;陳敏;廖小飛;;大數據應用的現狀與展望[J];計算機研究與發(fā)展;2013年S2期
9 劉淑芬;孟冬雪;王曉燕;;基于網格單元的DBSCAN算法[J];吉林大學學報(工學版);2014年04期
10 劉義;景寧;陳犖;熊偉;;MapReduce框架下基于R-樹的k-近鄰連接算法[J];軟件學報;2013年08期
,本文編號:1457393
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1457393.html