基于Hadoop的社區(qū)發(fā)現算法并行化研究

發(fā)布時間：2018-04-22 08:04

本文選題：社區(qū)發(fā)現 + Hadoop　；參考：《江西理工大學》2017年碩士論文

【摘要】：社區(qū)發(fā)現算法是研究復雜網絡中社區(qū)結構的主要方法。隨著網絡規(guī)模爆發(fā)式的增長,傳統(tǒng)單機、串行的社區(qū)發(fā)現算法已不適用于處理當前大規(guī)模的網絡。Hadoop作為新興的一種大數據處理技術,因其高擴展、高可靠、編程模型簡單受到許多開發(fā)者的青睞。針對當前串行社區(qū)發(fā)現算法處理網絡規(guī)模有限問題,本文結合Hadoop框架在大數據處理方面的優(yōu)勢提出兩種串行社區(qū)發(fā)現算法并行化改造方案。針對Fast-Newman算法計算節(jié)點模塊度復雜度比較高問題,本文提出基于Hadoop框架調度的Fast-Newman并行算法。并行化Fast-Newman算法存在的主要難點為在map函數中各節(jié)點無法獲取其鄰居節(jié)點的信息,故還需借助Web服務提供全局圖信息。改進后的Fast-Newman算法將在map函數中并行的計算每個節(jié)點與其鄰居節(jié)點合并后的模塊度增量,在reduce函數中匯總找出模塊度增量最大的兩個節(jié)點并將該兩節(jié)點合并,此為一次合并過程。合并后的結果重新作為map函數的輸入,迭代執(zhí)行map、reduce過程直到所有節(jié)點合并成一個社區(qū)。采用Ego-Facebook作為數據集,在仿真環(huán)境下實驗結果表明并行化后的Fast-Newman算法具有較高的加速比。針對處理大規(guī)模網絡難問題,本文提出基于Hadoop的Fast-Unfolding并行化算法PFU(Parallel-Fast-Unfolding)。該算法主要采用“分而治之”的思想,首先將大規(guī)模網絡分區(qū)并各自合并,然后根據各分區(qū)合并結果重構網絡,最后迭代合并重構網絡直到社區(qū)結構不再發(fā)生變化。該并行化方案存在兩個難點:一是如何保證分區(qū)后邊連接信息不會丟失;二是分區(qū)完成后如何重構網絡。針對上述兩個難點,本文通過改進數據存儲方式以及設計重構方案有效地解決了該問題。在真實網絡和生成網絡兩種數據集上實驗結果表明,PFU算法在保證準確率的基礎上明顯的提高了算法運行的效率,具有較好的擴展性。最后,根據map、reduce階段輸出的中間文件,用gephi軟件對結果進行可視化提高了PFU算法的應用價值。
[Abstract]:Community discovery algorithm is the main method to study the community structure in complex networks. With the explosive growth of network scale, the traditional single-machine, serial community discovery algorithm is no longer suitable to deal with the current large-scale network. Hadoop as a new big data processing technology, because of its high expansion, high reliability, Simple programming models are popular with many developers. In view of the limited network size of the current serial community discovery algorithm, this paper proposes two parallel transformation schemes of serial community discovery algorithm based on the advantages of Hadoop framework in big data processing. Aiming at the high complexity of computing node modules in Fast-Newman algorithm, a Fast-Newman parallel algorithm based on Hadoop framework scheduling is proposed in this paper. The main difficulty of parallelization Fast-Newman algorithm is that each node can not get the information of its neighbor nodes in the map function, so it is necessary to provide global graph information with the help of Web service. The improved Fast-Newman algorithm will calculate the modular degree increment of each node and its neighbor nodes in parallel in map function, and find out the two nodes with the largest module degree increment in the reduce function and merge the two nodes together, which is a merging process. The merged results are reused as input to the map function, iterating through the map-reduce process until all nodes are merged into one community. Using Ego-Facebook as data set, the simulation results show that the parallel Fast-Newman algorithm has a high speedup. In order to deal with the problem of large scale network, this paper proposes a parallel Fast-Unfolding algorithm based on Hadoop, PFUU Parallel-Fast-Unfolding algorithm (PFU / Parallel-Fast-Unfolding). The algorithm mainly adopts the idea of "divide and conquer". Firstly, the large-scale network is partitioned and merged separately, then the network is reconstructed according to the results of each partition merging. Finally, the network is merged and reconstructed iteratively until the community structure is no longer changed. There are two difficulties in the parallelization scheme: one is how to ensure that the connection information behind the partition will not be lost; the other is how to reconstruct the network after the partition is completed. In view of the above two difficulties, this paper solves the problem effectively by improving the data storage method and designing the refactoring scheme. Experimental results on two data sets of real network and generated network show that the PFU algorithm can obviously improve the efficiency and scalability of the algorithm on the basis of ensuring the accuracy. Finally, according to the intermediate file output from map reduce phase, the application value of PFU algorithm is improved by using gephi software to visualize the result.
【學位授予單位】：江西理工大學
【學位級別】：碩士
【學位授予年份】：2017
【分類號】：TP311.13;O157.5

【參考文獻】

相關期刊論文前10條

1 吳衛(wèi)江;李沐南;李國和;;Louvain算法的并行化處理[J];計算機與數字工程;2016年08期

2 陳羽中;施松;朱偉平;於志勇;郭昆;;一種基于鄰域跟隨關系的增量社區(qū)發(fā)現算法[J];計算機學報;2017年03期

3 王昊宇;吳斌;;基于MapReduce的二分圖社團發(fā)現[J];計算機應用與軟件;2015年06期

4 孫霞;張敏超;馮筠;張蕾;何緋娟;;Hadoop框架下的多標簽傳播算法[J];西安交通大學學報;2015年05期

5 辛宇;楊靜;謝志強;;基于隨機游走的語義重疊社區(qū)發(fā)現算法[J];計算機研究與發(fā)展;2015年02期

6 畢娟;秦志光;;基于概率主題模型的社交網絡層次化社區(qū)發(fā)現算法[J];電子科技大學學報;2014年06期

7 和亮;馮登國;蘇璞睿;應凌云;楊軼;;基于社團并行發(fā)現的在線社交網絡蠕蟲抑制[J];計算機學報;2015年04期

8 馬千里;張俊浩;;一種局部強化的多標簽傳播社區(qū)發(fā)現算法[J];計算機工程;2014年06期

9 武森;盧丹;馮小東;杜彥南;;基于大規(guī)模復雜網絡社區(qū)發(fā)現的科研合著網絡分析[J];中國科技論文;2014年04期

10 李冠辰;;一個基于hadoop的并行社交網絡挖掘系統(tǒng)[J];軟件;2013年12期

，

本文編號：1786300

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1786300.html

上一篇：基于數據流模式表示的半懶惰式分類算法研究
下一篇：基于詞向量聚類的中文微博產品命名實體識別

論文發(fā)表

·知網|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于Hadoop的社區(qū)發(fā)現算法并行化研究