基于Hadoop的社區(qū)發(fā)現(xiàn)算法并行化研究
本文選題:社區(qū)發(fā)現(xiàn) + Hadoop; 參考:《江西理工大學(xué)》2017年碩士論文
【摘要】:社區(qū)發(fā)現(xiàn)算法是研究復(fù)雜網(wǎng)絡(luò)中社區(qū)結(jié)構(gòu)的主要方法。隨著網(wǎng)絡(luò)規(guī)模爆發(fā)式的增長(zhǎng),傳統(tǒng)單機(jī)、串行的社區(qū)發(fā)現(xiàn)算法已不適用于處理當(dāng)前大規(guī)模的網(wǎng)絡(luò)。Hadoop作為新興的一種大數(shù)據(jù)處理技術(shù),因其高擴(kuò)展、高可靠、編程模型簡(jiǎn)單受到許多開(kāi)發(fā)者的青睞。針對(duì)當(dāng)前串行社區(qū)發(fā)現(xiàn)算法處理網(wǎng)絡(luò)規(guī)模有限問(wèn)題,本文結(jié)合Hadoop框架在大數(shù)據(jù)處理方面的優(yōu)勢(shì)提出兩種串行社區(qū)發(fā)現(xiàn)算法并行化改造方案。針對(duì)Fast-Newman算法計(jì)算節(jié)點(diǎn)模塊度復(fù)雜度比較高問(wèn)題,本文提出基于Hadoop框架調(diào)度的Fast-Newman并行算法。并行化Fast-Newman算法存在的主要難點(diǎn)為在map函數(shù)中各節(jié)點(diǎn)無(wú)法獲取其鄰居節(jié)點(diǎn)的信息,故還需借助Web服務(wù)提供全局圖信息。改進(jìn)后的Fast-Newman算法將在map函數(shù)中并行的計(jì)算每個(gè)節(jié)點(diǎn)與其鄰居節(jié)點(diǎn)合并后的模塊度增量,在reduce函數(shù)中匯總找出模塊度增量最大的兩個(gè)節(jié)點(diǎn)并將該兩節(jié)點(diǎn)合并,此為一次合并過(guò)程。合并后的結(jié)果重新作為map函數(shù)的輸入,迭代執(zhí)行map、reduce過(guò)程直到所有節(jié)點(diǎn)合并成一個(gè)社區(qū)。采用Ego-Facebook作為數(shù)據(jù)集,在仿真環(huán)境下實(shí)驗(yàn)結(jié)果表明并行化后的Fast-Newman算法具有較高的加速比。針對(duì)處理大規(guī)模網(wǎng)絡(luò)難問(wèn)題,本文提出基于Hadoop的Fast-Unfolding并行化算法PFU(Parallel-Fast-Unfolding)。該算法主要采用“分而治之”的思想,首先將大規(guī)模網(wǎng)絡(luò)分區(qū)并各自合并,然后根據(jù)各分區(qū)合并結(jié)果重構(gòu)網(wǎng)絡(luò),最后迭代合并重構(gòu)網(wǎng)絡(luò)直到社區(qū)結(jié)構(gòu)不再發(fā)生變化。該并行化方案存在兩個(gè)難點(diǎn):一是如何保證分區(qū)后邊連接信息不會(huì)丟失;二是分區(qū)完成后如何重構(gòu)網(wǎng)絡(luò)。針對(duì)上述兩個(gè)難點(diǎn),本文通過(guò)改進(jìn)數(shù)據(jù)存儲(chǔ)方式以及設(shè)計(jì)重構(gòu)方案有效地解決了該問(wèn)題。在真實(shí)網(wǎng)絡(luò)和生成網(wǎng)絡(luò)兩種數(shù)據(jù)集上實(shí)驗(yàn)結(jié)果表明,PFU算法在保證準(zhǔn)確率的基礎(chǔ)上明顯的提高了算法運(yùn)行的效率,具有較好的擴(kuò)展性。最后,根據(jù)map、reduce階段輸出的中間文件,用gephi軟件對(duì)結(jié)果進(jìn)行可視化提高了PFU算法的應(yīng)用價(jià)值。
[Abstract]:Community discovery algorithm is the main method to study the community structure in complex networks. With the explosive growth of network scale, the traditional single-machine, serial community discovery algorithm is no longer suitable to deal with the current large-scale network. Hadoop as a new big data processing technology, because of its high expansion, high reliability, Simple programming models are popular with many developers. In view of the limited network size of the current serial community discovery algorithm, this paper proposes two parallel transformation schemes of serial community discovery algorithm based on the advantages of Hadoop framework in big data processing. Aiming at the high complexity of computing node modules in Fast-Newman algorithm, a Fast-Newman parallel algorithm based on Hadoop framework scheduling is proposed in this paper. The main difficulty of parallelization Fast-Newman algorithm is that each node can not get the information of its neighbor nodes in the map function, so it is necessary to provide global graph information with the help of Web service. The improved Fast-Newman algorithm will calculate the modular degree increment of each node and its neighbor nodes in parallel in map function, and find out the two nodes with the largest module degree increment in the reduce function and merge the two nodes together, which is a merging process. The merged results are reused as input to the map function, iterating through the map-reduce process until all nodes are merged into one community. Using Ego-Facebook as data set, the simulation results show that the parallel Fast-Newman algorithm has a high speedup. In order to deal with the problem of large scale network, this paper proposes a parallel Fast-Unfolding algorithm based on Hadoop, PFUU Parallel-Fast-Unfolding algorithm (PFU / Parallel-Fast-Unfolding). The algorithm mainly adopts the idea of "divide and conquer". Firstly, the large-scale network is partitioned and merged separately, then the network is reconstructed according to the results of each partition merging. Finally, the network is merged and reconstructed iteratively until the community structure is no longer changed. There are two difficulties in the parallelization scheme: one is how to ensure that the connection information behind the partition will not be lost; the other is how to reconstruct the network after the partition is completed. In view of the above two difficulties, this paper solves the problem effectively by improving the data storage method and designing the refactoring scheme. Experimental results on two data sets of real network and generated network show that the PFU algorithm can obviously improve the efficiency and scalability of the algorithm on the basis of ensuring the accuracy. Finally, according to the intermediate file output from map reduce phase, the application value of PFU algorithm is improved by using gephi software to visualize the result.
【學(xué)位授予單位】:江西理工大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:TP311.13;O157.5
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 吳衛(wèi)江;李沐南;李國(guó)和;;Louvain算法的并行化處理[J];計(jì)算機(jī)與數(shù)字工程;2016年08期
2 陳羽中;施松;朱偉平;於志勇;郭昆;;一種基于鄰域跟隨關(guān)系的增量社區(qū)發(fā)現(xiàn)算法[J];計(jì)算機(jī)學(xué)報(bào);2017年03期
3 王昊宇;吳斌;;基于MapReduce的二分圖社團(tuán)發(fā)現(xiàn)[J];計(jì)算機(jī)應(yīng)用與軟件;2015年06期
4 孫霞;張敏超;馮筠;張蕾;何緋娟;;Hadoop框架下的多標(biāo)簽傳播算法[J];西安交通大學(xué)學(xué)報(bào);2015年05期
5 辛宇;楊靜;謝志強(qiáng);;基于隨機(jī)游走的語(yǔ)義重疊社區(qū)發(fā)現(xiàn)算法[J];計(jì)算機(jī)研究與發(fā)展;2015年02期
6 畢娟;秦志光;;基于概率主題模型的社交網(wǎng)絡(luò)層次化社區(qū)發(fā)現(xiàn)算法[J];電子科技大學(xué)學(xué)報(bào);2014年06期
7 和亮;馮登國(guó);蘇璞睿;應(yīng)凌云;楊軼;;基于社團(tuán)并行發(fā)現(xiàn)的在線社交網(wǎng)絡(luò)蠕蟲抑制[J];計(jì)算機(jī)學(xué)報(bào);2015年04期
8 馬千里;張俊浩;;一種局部強(qiáng)化的多標(biāo)簽傳播社區(qū)發(fā)現(xiàn)算法[J];計(jì)算機(jī)工程;2014年06期
9 武森;盧丹;馮小東;杜彥南;;基于大規(guī)模復(fù)雜網(wǎng)絡(luò)社區(qū)發(fā)現(xiàn)的科研合著網(wǎng)絡(luò)分析[J];中國(guó)科技論文;2014年04期
10 李冠辰;;一個(gè)基于hadoop的并行社交網(wǎng)絡(luò)挖掘系統(tǒng)[J];軟件;2013年12期
,本文編號(hào):1786300
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1786300.html