天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當(dāng)前位置:主頁 > 科技論文 > 軟件論文 >

基于MapReduce的分布式快速聚類算法研究

發(fā)布時間:2018-01-23 11:05

  本文關(guān)鍵詞: 海量數(shù)據(jù) 聚類 MapReduce 并行計算 數(shù)據(jù)挖掘 出處:《東北電力大學(xué)》2017年碩士論文 論文類型:學(xué)位論文


【摘要】:隨著信息技術(shù)的高速發(fā)展,數(shù)據(jù)規(guī)模呈現(xiàn)指數(shù)級增長態(tài)勢,傳統(tǒng)聚類算法面臨巨大的挑戰(zhàn)。一是海量數(shù)據(jù)內(nèi)的噪聲雜、冗余度高、價值密度低,聚類算法的準(zhǔn)確率不高;二是串行聚類算法面對海量數(shù)據(jù)時,搜索鄰域代價巨大,執(zhí)行效率無法適應(yīng)實際需求。針對上述問題,本文充分分析數(shù)據(jù)特點,基于MapReduce大數(shù)據(jù)處理框架,設(shè)計了分布式快速聚類算法,實現(xiàn)了高效、高精度的并行數(shù)據(jù)聚類。針對海量數(shù)據(jù)中冗余度高,無價值數(shù)據(jù)繁多的問題,本文提出一種基于MapReduce的分布式數(shù)據(jù)約減算法。通過一種新的抽樣算法計算數(shù)據(jù)點的矩形域和抽樣域,并在抽樣域中確定樣本數(shù)據(jù),然后對樣本數(shù)據(jù)進(jìn)行擴(kuò)展抽樣來達(dá)到約減原始數(shù)據(jù)集的目的,最后提出一種代表性驗證策略來檢驗樣本集,從而解決海量數(shù)據(jù)聚類產(chǎn)生巨大I/O開銷和網(wǎng)絡(luò)開銷的問題。針對搜索最近鄰代價消耗大,聚類執(zhí)行效率低的問題,本文利用Map任務(wù)對樣本數(shù)據(jù)集進(jìn)行相等大小的數(shù)據(jù)劃分,Reduce任務(wù)對數(shù)據(jù)子集進(jìn)行局部密度聚類,因此針對單節(jié)點提出基于擴(kuò)展區(qū)域查詢的密度聚類算法。首先通過基于固定網(wǎng)格的擴(kuò)展區(qū)域查詢方法,確定數(shù)據(jù)點最近鄰和反最近鄰的鄰域關(guān)系,建立每個數(shù)據(jù)點的影響空間域,然后提出異常點判定函數(shù),使算法能夠準(zhǔn)確地識別噪聲點和邊界點。Reduce聚類任務(wù)結(jié)束后輸出局部聚類結(jié)果,為得到面向整個數(shù)據(jù)集的全局聚類結(jié)果,本文提出一種基于簇間距離的局部類簇合并算法,通過簇間距離的計算確定局部類簇間的分布關(guān)系,得到可以兩兩合并的局部類簇對,然后根據(jù)連通子圖發(fā)現(xiàn)方法合并局部類簇對,最后輸出全局聚類結(jié)果。實驗結(jié)果表明,本文提出的算法有效地將海量數(shù)據(jù)進(jìn)行約減,保證了樣本數(shù)據(jù)與原始數(shù)據(jù)分布的一致性,在信息量無損失的前提下降低了數(shù)據(jù)冗余,并且該算法能夠快速處理任意形狀的類簇,大幅度提高了算法的執(zhí)行效率和聚類質(zhì)量。
[Abstract]:With the rapid development of information technology, the scale of data is increasing exponentially. The traditional clustering algorithm is faced with great challenges. One is the noise clutter, high redundancy and low value density in massive data. The accuracy of clustering algorithm is not high; Second, the serial clustering algorithm in the face of massive data, the search cost is huge, the execution efficiency can not meet the actual needs. In view of the above problems, this paper fully analyzes the characteristics of the data. Based on the MapReduce big data processing framework, a distributed fast clustering algorithm is designed to achieve efficient and accurate parallel data clustering. In this paper, a distributed data reduction algorithm based on MapReduce is proposed, and a new sampling algorithm is proposed to calculate the rectangular and sampling regions of data points. The sample data is determined in the sampling domain, and then the sample data is extended to reduce the original data set. Finally, a representative validation strategy is proposed to test the sample set. In order to solve the problem of huge I / O overhead and network overhead caused by massive data clustering, aiming at the problem of high cost of searching nearest neighbor and low efficiency of clustering execution. In this paper, the Map task is used to partition the sample data set with equal size and reduce task to cluster the local density of the data subset. Therefore, a density clustering algorithm based on extended region query is proposed for single node. Firstly, the neighborhood relationship between nearest neighbor and anti-nearest neighbor of data point is determined by the extended region query method based on fixed grid. The influence spatial domain of each data point is established, and the outlier decision function is proposed, which enables the algorithm to accurately identify the noise point and the boundary point. In order to obtain the global clustering results for the whole data set, this paper proposes a local cluster merging algorithm based on the distance between clusters, which determines the distribution of local clusters by calculating the distance between clusters. The local cluster pairs which can be merged by pairwise are obtained, and the local cluster pairs are merged according to the connected subgraph discovery method. Finally, the global clustering results are outputted. The experimental results show that. The algorithm proposed in this paper can effectively reduce the mass data, ensure the consistency of the distribution between the sample data and the original data, and reduce the data redundancy without loss of information. Moreover, the algorithm can deal with arbitrary clusters quickly, which greatly improves the efficiency and clustering quality of the algorithm.
【學(xué)位授予單位】:東北電力大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2017
【分類號】:TP311.13

【參考文獻(xiàn)】

相關(guān)期刊論文 前10條

1 韓巖;李曉;;加速大數(shù)據(jù)聚類K-means算法的改進(jìn)[J];計算機(jī)工程與設(shè)計;2015年05期

2 于彥偉;王歡;王沁;趙金東;;面向海量數(shù)據(jù)流的基于密度的簇結(jié)構(gòu)挖掘算法[J];軟件學(xué)報;2015年05期

3 程學(xué)旗;靳小龍;王元卓;郭嘉豐;張鐵贏;李國杰;;大數(shù)據(jù)系統(tǒng)和分析技術(shù)綜述[J];軟件學(xué)報;2014年09期

4 陸穎華;馬廷淮;鐘水明;曹杰;王新;Abdullah Al-Dhelaane;;Improved locality-sensitive hashing method for the approximate nearest neighbor problem[J];Chinese Physics B;2014年08期

5 崔穎安;李雪;王志曉;張德運;;在線社交媒體數(shù)據(jù)抽樣方法的比較研究[J];計算機(jī)學(xué)報;2014年08期

6 顧榮;嚴(yán)金雙;楊曉亮;袁春風(fēng);黃宜華;;Hadoop MapReduce短作業(yè)執(zhí)行性能優(yōu)化[J];計算機(jī)研究與發(fā)展;2014年06期

7 崔穎安;李雪;王志曉;張德運;;社會化媒體大數(shù)據(jù)多階段整群抽樣方法[J];軟件學(xué)報;2014年04期

8 張引;陳敏;廖小飛;;大數(shù)據(jù)應(yīng)用的現(xiàn)狀與展望[J];計算機(jī)研究與發(fā)展;2013年S2期

9 劉淑芬;孟冬雪;王曉燕;;基于網(wǎng)格單元的DBSCAN算法[J];吉林大學(xué)學(xué)報(工學(xué)版);2014年04期

10 劉義;景寧;陳犖;熊偉;;MapReduce框架下基于R-樹的k-近鄰連接算法[J];軟件學(xué)報;2013年08期

,

本文編號:1457393

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1457393.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶f71f9***提供,本站僅收錄摘要或目錄,作者需要刪除請E-mail郵箱bigeng88@qq.com
日本不卡在线视频你懂的| 色丁香一区二区黑人巨大| 国产av一区二区三区四区五区| 久久99热成人网不卡| 日韩精品人妻少妇一区二区| 欧美午夜一级特黄大片| 亚洲精品国产第一区二区多人| 精品人妻一区二区三区免费| 亚洲国产欧美久久精品| 亚洲黑人精品一区二区欧美| 国产成人精品一区二三区在线观看 | 九九视频通过这里有精品| 白白操白白在线免费观看| 在线观看免费视频你懂的| 99久久婷婷国产亚洲综合精品| 国产精品久久久久久久久久久痴汉| 国产又粗又长又大的视频| 亚洲精选91福利在线观看| av在线免费观看一区二区三区| 国产欧美日韩精品成人专区| 欧美乱视频一区二区三区| 成人欧美精品一区二区三区| 欧美区一区二区在线观看| 九九视频通过这里有精品| 亚洲中文字幕亲近伦片| 国产精品一区二区三区黄色片| 成人免费观看视频免费| 免费大片黄在线观看国语| 欧洲一区二区三区自拍天堂| 日本妇女高清一区二区三区| 激情综合五月开心久久| 激情少妇一区二区三区| 精品久久少妇激情视频| 成人精品一区二区三区在线| 国产a天堂一区二区专区| 亚洲最大福利在线观看| 91播色在线免费播放| 丝袜诱惑一区二区三区| 成人免费在线视频大香蕉| 字幕日本欧美一区二区| 国产又大又硬又粗又湿|