基于內(nèi)存計算的基因型—表型關(guān)聯(lián)技術(shù)研究

發(fā)布時間：2018-02-28 23:28

本文關(guān)鍵詞： 疾病表型致病基因優(yōu)先級 TrustRank 大數(shù)據(jù)　出處：《哈爾濱工業(yè)大學(xué)》2017年碩士論文　論文類型：學(xué)位論文

【摘要】：伴隨生物醫(yī)學(xué)數(shù)據(jù)得到爆炸式增長,快速發(fā)展的生物信息學(xué)也在不斷剖析這些數(shù)據(jù)背后隱藏的信息,相關(guān)研究已成為熱點。識別致病基因是人類健康研究的根本挑戰(zhàn),針對識別致病基因就要通過生物網(wǎng)絡(luò)了解基因型與疾病表型的關(guān)聯(lián)關(guān)系。海量生物數(shù)據(jù)存儲在各種沒有統(tǒng)一標(biāo)準(zhǔn)化的數(shù)據(jù)庫中,生物網(wǎng)絡(luò)都是以這些數(shù)據(jù)為基礎(chǔ)構(gòu)建起來,而且研究生物網(wǎng)絡(luò)也是在對探索復(fù)雜生命活動。疾病表型與基因型的關(guān)聯(lián)關(guān)系對于致病基因的預(yù)測和尋找基因?qū)е碌募膊《季哂猩钸h(yuǎn)意義。根據(jù)疾病的模塊性表明,功能相關(guān)的蛋白質(zhì)會導(dǎo)致相似疾病。由此,研究疾病基因關(guān)聯(lián)方法大多集中于基于計算網(wǎng)絡(luò),整合了蛋白質(zhì)相互作用網(wǎng)絡(luò)、疾病表型相似性網(wǎng)絡(luò)和疾病-基因二分網(wǎng)絡(luò)。在線孟德爾遺傳(OMIM)是人類遺傳疾病和相關(guān)基因的數(shù)據(jù)庫,基于OMIM數(shù)據(jù)我們計算形成了疾病表型相似性網(wǎng)絡(luò)和疾病基因?qū)?yīng)網(wǎng)絡(luò),再加上蛋白質(zhì)相互作用網(wǎng)絡(luò),整合構(gòu)建復(fù)雜的異構(gòu)網(wǎng)絡(luò)。本文介紹了相關(guān)的重啟游走算法,通過改進(jìn)網(wǎng)頁排序算法Trust Rank后形成YSearch方法。算法首先根據(jù)構(gòu)建網(wǎng)絡(luò)選擇查詢疾病(基因)的先驗知識(種子集),通過全局網(wǎng)絡(luò)的隨機(jī)游走策略迭代處理得到TR分?jǐn)?shù),然后對候選基因與疾病進(jìn)行優(yōu)先級排序,實現(xiàn)預(yù)測功能。并且針對算法效果進(jìn)行留一交叉驗證,采用ROC曲線與其他方法比較實驗結(jié)果,證明算法的良好性能。以此,我們設(shè)計并開發(fā)了基因疾病的搜索引擎平臺YSearch,整個系統(tǒng)是搭建在基于內(nèi)存計算的spark大數(shù)據(jù)平臺,數(shù)據(jù)存儲在HBase中,并對系統(tǒng)進(jìn)行相關(guān)介紹與優(yōu)化。本文的算法與平臺都可以對疾病診斷與治療等臨床研究提供新思路。
[Abstract]:With the explosive growth of biomedical data, the rapid development of bioinformatics is also analyzing the hidden information behind these data. The related research has become a hot spot. Identification of pathogenic genes is a fundamental challenge in human health research. In order to identify pathogenic genes, we need to understand the relationship between genotypes and disease phenotypes through biological networks. Massive biological data are stored in a variety of databases that are not standardized, and biological networks are built on the basis of these data. Moreover, the study of biological networks is also useful in exploring complex life activities. The association between disease phenotypes and genotypes is of great significance for the prediction of pathogenic genes and the search for diseases caused by genes. Functionally related proteins can lead to similar diseases. Therefore, most of the methods of studying disease gene association are based on computational networks and integrate protein interaction networks. Online Mendelian genetic network is a database of human genetic diseases and related genes. Based on OMIM data, we calculate the disease phenotypic similarity network and disease gene corresponding network. In addition, protein interaction networks are integrated to construct complex heterogeneous networks. The YSearch method is formed by improving the Trust Rank algorithm. Firstly, the algorithm selects a priori knowledge (seed set) to query the disease (gene) according to the construction of the network, and obtains the tr score by iterating the random walk strategy of the global network. Then the candidate genes and diseases are prioritized to achieve the function of prediction, and a cross-validation of the effectiveness of the algorithm is carried out. The experimental results are compared with other methods by using the ROC curve, and the good performance of the algorithm is proved. We have designed and developed the search engine platform YSearch. the whole system is built on the spark big data platform based on memory computing, and the data is stored in HBase. The algorithm and platform of this paper can provide new ideas for clinical research such as disease diagnosis and treatment.
【學(xué)位授予單位】：哈爾濱工業(yè)大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2017
【分類號】：R3416;TP311.13

【參考文獻(xiàn)】

相關(guān)期刊論文前1條

1 袁芳;李靖;;基于功能相似性預(yù)測疾病基因[J];計算機(jī)應(yīng)用研究;2012年11期

相關(guān)博士學(xué)位論文前3條

1 梁媚媚;基因網(wǎng)絡(luò)信息搜索引擎的構(gòu)建、優(yōu)化與應(yīng)用[D];浙江大學(xué);2015年

2 程亮;基于本體的疾病數(shù)據(jù)整合與挖掘方法研究[D];哈爾濱工業(yè)大學(xué);2014年

3 陳文海;關(guān)于基因型—表型相關(guān)問題的統(tǒng)計遺傳學(xué)及計算生物學(xué)分析[D];復(fù)旦大學(xué);2014年

相關(guān)碩士學(xué)位論文前2條

1 邵海珠;基于協(xié)同過濾的疾病基因預(yù)測方法[D];西安電子科技大學(xué);2014年

2 雋立然;基于生物醫(yī)學(xué)本體的生物信息數(shù)據(jù)庫集成方法研究[D];哈爾濱工業(yè)大學(xué);2009年

，

本文編號：1549457

資料下載