PageRank算法的改進(jìn)及在生物網(wǎng)絡(luò)數(shù)據(jù)上的應(yīng)用
發(fā)布時(shí)間:2018-08-16 11:17
【摘要】:作為世界上最成功的網(wǎng)頁(yè)排名算法,PageRank算法于1998年由Google的兩位創(chuàng)始人Larry Page和Sergey Brin開(kāi)發(fā),最初被用于Google的搜索引擎。該算法將用戶的瀏覽行為描述為一個(gè)Markov隨機(jī)沖浪模型,為了滿足網(wǎng)絡(luò)的強(qiáng)連通特性,算法中加入阻尼系數(shù)d,d的取值范圍為[0,1],因此可將其視為如何選取下一個(gè)節(jié)點(diǎn)的概率。從概率的角度,PageRank算法可理解為:以網(wǎng)絡(luò)中的某一節(jié)點(diǎn)為初始點(diǎn),用戶以d的概率沿著該節(jié)點(diǎn)某一條鏈出鏈接繼續(xù)瀏覽,以1-d的概率隨機(jī)瀏覽整個(gè)網(wǎng)絡(luò)中的某一點(diǎn)。PageRank算法的運(yùn)算過(guò)程須經(jīng)多次迭代,由于算法的轉(zhuǎn)移矩陣為一次馬爾科夫鏈,根據(jù)馬氏鏈的性質(zhì),算法最終能夠平穩(wěn)分布,而穩(wěn)定狀態(tài)下各網(wǎng)頁(yè)被訪問(wèn)的概率分布即為各節(jié)點(diǎn)的PageRank值,表達(dá)了各點(diǎn)在網(wǎng)絡(luò)中的重要程度。 PageRank算法單純基于網(wǎng)絡(luò)的拓?fù)浣Y(jié)構(gòu)來(lái)判定節(jié)點(diǎn)的重要程度,傳統(tǒng)PageRank算法將一個(gè)節(jié)點(diǎn)的所有鏈出節(jié)點(diǎn)視為同等重要,因此在計(jì)算轉(zhuǎn)移矩陣時(shí),如果節(jié)點(diǎn)i有n個(gè)鏈出節(jié)點(diǎn),那么i的每個(gè)鏈出節(jié)點(diǎn)平均分得i的PageRank值的1/n。這顯然是不合乎事實(shí)的,以Internet為例,假設(shè)有網(wǎng)頁(yè)i指向n個(gè)子網(wǎng)頁(yè),盡管這n個(gè)網(wǎng)頁(yè)同為i的后繼網(wǎng)頁(yè),但這其中有些網(wǎng)頁(yè)相對(duì)重要,客戶訪問(wèn)它的幾率就大,因此i的n個(gè)鏈出節(jié)點(diǎn)中有些權(quán)值要大于1/n,有些則很小。本文基于這一思想,對(duì)傳統(tǒng)的PageRank算法進(jìn)行了改進(jìn),理論上改進(jìn)算法更符合客觀事實(shí)。最后利用JAVA和MATLAB等技術(shù)將算法編程實(shí)現(xiàn)。 PageRank算法的易于理解性使得該算法廣為流傳,學(xué)者們已不滿足只在搜索引擎中使用該算法,,凡是具有大量數(shù)據(jù)的網(wǎng)絡(luò)環(huán)境都可以利用PageRank算法加以分析。生物信息學(xué)中的蛋白質(zhì)-蛋白質(zhì)交互網(wǎng)絡(luò)(PPI)中包含了大量的待挖掘的生物信息,與萬(wàn)維網(wǎng)(www)一樣,兩者都為Scale-Free網(wǎng)絡(luò),這為PPI網(wǎng)絡(luò)與PageRank算法的結(jié)合提供了前提。很多學(xué)者已經(jīng)在這一領(lǐng)域做過(guò)嘗試。本文在與前人數(shù)據(jù)相同的PPI網(wǎng)絡(luò)數(shù)據(jù)上運(yùn)行改進(jìn)的PageRank算法程序,利用已知的幾種在黑素瘤病人血液中測(cè)出的差異表達(dá)的蛋白質(zhì),找到了與這些蛋白相關(guān)的一組蛋白質(zhì),之后通過(guò)對(duì)比分析新舊兩組結(jié)果,驗(yàn)證了改進(jìn)算法的結(jié)果更符合生物意義。
[Abstract]:The PageRank algorithm, the world's most successful page ranking algorithm, was developed in 1998 by Larry Page and Sergey Brin, the founders of Google, and was originally used as a search engine for Google. In this algorithm, the browsing behavior of users is described as a Markov random surfing model. In order to satisfy the strong connectivity of the network, the value range of adding damping coefficient dbd in the algorithm is [0], so it can be regarded as the probability of how to select the next node. From the point of view of probability, the PageRank algorithm can be understood as: taking a node in the network as the initial point, the user continues to browse along a link out of the chain of the node with the probability of d. The operation process of the PageRank algorithm has to be iterated through several iterations with the probability of 1-d. Because the transfer matrix of the algorithm is a Markov chain, according to the properties of Markov chain, the algorithm can be distributed stably. In the stable state, the probability distribution of each web page being visited is the PageRank value of each node, which expresses the importance of each point in the network. The PageRank algorithm determines the importance of the node simply based on the topology of the network. In the traditional PageRank algorithm, all the nodes out of the chain of a node are considered as equally important, so when calculating the transfer matrix, if node I has n nodes out of the chain, then the average value of the PageRank value of I is 1 / n of the PageRank value of I for each out node of I. This is obviously not true. Let's say Internet, for example, has a web page I pointing to n subpages, and even though these n pages are successors to I, some of these pages are relatively important, and the chances of clients visiting them are high. As a result, some of the n chain out nodes of I have weights greater than 1 / n, while others are very small. Based on this idea, this paper improves the traditional PageRank algorithm, which is more in line with the objective facts in theory. Finally, using JAVA and MATLAB to program the algorithm. The easy-to-understand PageRank algorithm makes the algorithm popular, scholars are not satisfied to only use the algorithm in search engines. Any network environment with a large amount of data can be analyzed by using PageRank algorithm. The protein-protein interaction network (PPI) in bioinformatics contains a lot of biological information to be mined. Like the World wide Web (www), both of them are Scale-Free networks, which provide a prerequisite for the combination of PPI networks and PageRank algorithms. Many scholars have tried in this field. In this paper, we run the improved PageRank algorithm program on the PPI network data which is the same as the previous data, and find a set of proteins associated with these proteins using several known differentially expressed proteins in the blood of melanoma patients. By comparing the results of the two groups, it is proved that the improved algorithm is more suitable for biological significance.
【學(xué)位授予單位】:吉林大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2012
【分類(lèi)號(hào)】:TP301.6;TP393.0
本文編號(hào):2185826
[Abstract]:The PageRank algorithm, the world's most successful page ranking algorithm, was developed in 1998 by Larry Page and Sergey Brin, the founders of Google, and was originally used as a search engine for Google. In this algorithm, the browsing behavior of users is described as a Markov random surfing model. In order to satisfy the strong connectivity of the network, the value range of adding damping coefficient dbd in the algorithm is [0], so it can be regarded as the probability of how to select the next node. From the point of view of probability, the PageRank algorithm can be understood as: taking a node in the network as the initial point, the user continues to browse along a link out of the chain of the node with the probability of d. The operation process of the PageRank algorithm has to be iterated through several iterations with the probability of 1-d. Because the transfer matrix of the algorithm is a Markov chain, according to the properties of Markov chain, the algorithm can be distributed stably. In the stable state, the probability distribution of each web page being visited is the PageRank value of each node, which expresses the importance of each point in the network. The PageRank algorithm determines the importance of the node simply based on the topology of the network. In the traditional PageRank algorithm, all the nodes out of the chain of a node are considered as equally important, so when calculating the transfer matrix, if node I has n nodes out of the chain, then the average value of the PageRank value of I is 1 / n of the PageRank value of I for each out node of I. This is obviously not true. Let's say Internet, for example, has a web page I pointing to n subpages, and even though these n pages are successors to I, some of these pages are relatively important, and the chances of clients visiting them are high. As a result, some of the n chain out nodes of I have weights greater than 1 / n, while others are very small. Based on this idea, this paper improves the traditional PageRank algorithm, which is more in line with the objective facts in theory. Finally, using JAVA and MATLAB to program the algorithm. The easy-to-understand PageRank algorithm makes the algorithm popular, scholars are not satisfied to only use the algorithm in search engines. Any network environment with a large amount of data can be analyzed by using PageRank algorithm. The protein-protein interaction network (PPI) in bioinformatics contains a lot of biological information to be mined. Like the World wide Web (www), both of them are Scale-Free networks, which provide a prerequisite for the combination of PPI networks and PageRank algorithms. Many scholars have tried in this field. In this paper, we run the improved PageRank algorithm program on the PPI network data which is the same as the previous data, and find a set of proteins associated with these proteins using several known differentially expressed proteins in the blood of melanoma patients. By comparing the results of the two groups, it is proved that the improved algorithm is more suitable for biological significance.
【學(xué)位授予單位】:吉林大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2012
【分類(lèi)號(hào)】:TP301.6;TP393.0
【參考文獻(xiàn)】
相關(guān)期刊論文 前1條
1 孫宇;賈凌云;任軍;;蛋白質(zhì)相互作用的研究方法[J];分析化學(xué);2007年05期
相關(guān)碩士學(xué)位論文 前2條
1 張巍;基于PageRank算法的搜索引擎優(yōu)化策略研究[D];四川大學(xué);2005年
2 張永強(qiáng);基于轉(zhuǎn)移概率的PageRank算法研究[D];暨南大學(xué);2009年
本文編號(hào):2185826
本文鏈接:http://sikaile.net/kejilunwen/sousuoyinqinglunwen/2185826.html
最近更新
教材專著