基于RNA-Seq數(shù)據(jù)的差異表達(dá)基因檢測(cè)算法研究
發(fā)布時(shí)間:2018-12-08 12:31
【摘要】:RNA-Seq(Ribonucleic Acid Sequencing)技術(shù)是現(xiàn)代生物信息學(xué)研究的常規(guī)實(shí)驗(yàn)手段,主要目的是篩選出測(cè)序數(shù)據(jù)中具有差異表達(dá)的基因,即檢測(cè)出不同樣本下表達(dá)量不相同的基因。差異表達(dá)分析是研究生物個(gè)體在不同發(fā)育階段或不同生理環(huán)境下同一類基因的差異表達(dá),不僅具有統(tǒng)計(jì)學(xué)意義而且具有生物學(xué)意義,為認(rèn)識(shí)和理解生命活動(dòng)過(guò)程本質(zhì)以及研究基因表達(dá)調(diào)控提供重要理論基礎(chǔ)。本文對(duì)檢測(cè)RNA-Seq數(shù)據(jù)中差異表達(dá)基因的處理流程進(jìn)行分析研究,主要內(nèi)容包括:(1)基于加權(quán)截尾均值化M值(The Trimmed Mean of M-values,TMM)標(biāo)準(zhǔn)化和幾何平均標(biāo)準(zhǔn)化,給出了基于變異系數(shù)中值絕對(duì)偏差調(diào)整的改進(jìn)標(biāo)準(zhǔn)化算法。首先分別使用TMM法和幾何平均法得到標(biāo)準(zhǔn)化的數(shù)據(jù),計(jì)算每行基因在兩組數(shù)據(jù)中的變異系數(shù),比較兩個(gè)變異系數(shù)得到最優(yōu)變異系數(shù),從而得到新數(shù)據(jù),然后對(duì)新數(shù)據(jù)進(jìn)行中值絕對(duì)偏差調(diào)整,實(shí)現(xiàn)數(shù)據(jù)的標(biāo)準(zhǔn)化。實(shí)驗(yàn)結(jié)果表明,本文算法不但能消除測(cè)序技術(shù)上的誤差,將所有測(cè)序樣本調(diào)整到同一水平,而且誤差更小,精度更高。(2)基于svaseq(Surogate Variable Analysis Sequencing)算法給出了去除批次效應(yīng)的改進(jìn)svaseq算法。首先根據(jù)相關(guān)顯著性參數(shù),分別構(gòu)建正則對(duì)數(shù)變換模型和對(duì)數(shù)變換模型,然后通過(guò)加權(quán)最小二乘法估計(jì)模型中的參數(shù),得到數(shù)據(jù)的殘差矩陣,對(duì)該矩陣進(jìn)行因子分解,估計(jì)替代變量。實(shí)驗(yàn)結(jié)果表明,本文算法能更好的消除數(shù)據(jù)中的批次效應(yīng),而且差異表達(dá)結(jié)果也有一定的提高。(3)基于DESeq(Differential Expression Sequencing)算法給出了檢測(cè)差異表達(dá)基因的改進(jìn)DESeq算法。假設(shè)數(shù)據(jù)服從負(fù)二項(xiàng)式分布模型,首先根據(jù)改進(jìn)的標(biāo)準(zhǔn)化因子估計(jì)樣本的測(cè)序總數(shù),計(jì)算模型的均值和方差并估計(jì)離散參數(shù),然后利用精確檢驗(yàn)進(jìn)行差異表達(dá)分析。實(shí)驗(yàn)結(jié)果表明,本文算法能更好的檢測(cè)差異表達(dá)基因,并且準(zhǔn)度提高了 6.9%。
[Abstract]:RNA-Seq (Ribonucleic Acid Sequencing) technology is a conventional experimental method for modern bioinformatics research. The main purpose of this technique is to screen genes with different expression in sequencing data, that is, to detect genes with different expression levels in different samples. Differential expression analysis is to study the differential expression of the same kind of genes in different developmental stages or different physiological environments, which not only has statistical significance but also has biological significance. It provides an important theoretical basis for understanding and understanding the nature of life process and studying the regulation of gene expression. In this paper, the process of detecting differentially expressed genes in RNA-Seq data is analyzed. The main contents are as follows: (1) Standardization and geometric mean standardization based on weighted truncated mean M value (The Trimmed Mean of M-valuesTMM; An improved standardization algorithm based on the adjustment of mean absolute deviation of coefficient of variation is presented. First, the standardized data are obtained by using TMM method and geometric average method respectively. The coefficient of variation of each row gene in two groups of data is calculated, and the optimum coefficient of variation is obtained by comparing the two coefficients of variation, and the new data are obtained. Then the median absolute deviation is adjusted to realize the standardization of the new data. The experimental results show that the algorithm can not only eliminate the error in sequencing technology, but also adjust all the samples to the same level, and the error is even smaller. (2) based on svaseq (Surogate Variable Analysis Sequencing) algorithm, an improved svaseq algorithm is proposed to remove batch effect. Firstly, the canonical logarithmic transformation model and the logarithmic transformation model are constructed according to the relevant salience parameters, then the parameters in the model are estimated by the weighted least square method, and the residual matrix of the data is obtained, and the matrix is factorized. Estimate alternative variables. The experimental results show that the proposed algorithm can eliminate the batch effect better, and the differential expression results are improved. (3) based on DESeq (Differential Expression Sequencing) algorithm, an improved DESeq algorithm for detecting differentially expressed genes is proposed. Assuming that the data is distributed according to the negative binomial distribution model, the total number of samples is estimated according to the improved standardized factor, the mean value and variance of the model are calculated and the discrete parameters are estimated, and then the differential expression analysis is carried out by using accurate test. The experimental results show that the proposed algorithm can detect differentially expressed genes better and improve the accuracy by 6.9%.
【學(xué)位授予單位】:大連海事大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:Q811.4
[Abstract]:RNA-Seq (Ribonucleic Acid Sequencing) technology is a conventional experimental method for modern bioinformatics research. The main purpose of this technique is to screen genes with different expression in sequencing data, that is, to detect genes with different expression levels in different samples. Differential expression analysis is to study the differential expression of the same kind of genes in different developmental stages or different physiological environments, which not only has statistical significance but also has biological significance. It provides an important theoretical basis for understanding and understanding the nature of life process and studying the regulation of gene expression. In this paper, the process of detecting differentially expressed genes in RNA-Seq data is analyzed. The main contents are as follows: (1) Standardization and geometric mean standardization based on weighted truncated mean M value (The Trimmed Mean of M-valuesTMM; An improved standardization algorithm based on the adjustment of mean absolute deviation of coefficient of variation is presented. First, the standardized data are obtained by using TMM method and geometric average method respectively. The coefficient of variation of each row gene in two groups of data is calculated, and the optimum coefficient of variation is obtained by comparing the two coefficients of variation, and the new data are obtained. Then the median absolute deviation is adjusted to realize the standardization of the new data. The experimental results show that the algorithm can not only eliminate the error in sequencing technology, but also adjust all the samples to the same level, and the error is even smaller. (2) based on svaseq (Surogate Variable Analysis Sequencing) algorithm, an improved svaseq algorithm is proposed to remove batch effect. Firstly, the canonical logarithmic transformation model and the logarithmic transformation model are constructed according to the relevant salience parameters, then the parameters in the model are estimated by the weighted least square method, and the residual matrix of the data is obtained, and the matrix is factorized. Estimate alternative variables. The experimental results show that the proposed algorithm can eliminate the batch effect better, and the differential expression results are improved. (3) based on DESeq (Differential Expression Sequencing) algorithm, an improved DESeq algorithm for detecting differentially expressed genes is proposed. Assuming that the data is distributed according to the negative binomial distribution model, the total number of samples is estimated according to the improved standardized factor, the mean value and variance of the model are calculated and the discrete parameters are estimated, and then the differential expression analysis is carried out by using accurate test. The experimental results show that the proposed algorithm can detect differentially expressed genes better and improve the accuracy by 6.9%.
【學(xué)位授予單位】:大連海事大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:Q811.4
【相似文獻(xiàn)】
相關(guān)期刊論文 前1條
1 劉學(xué)軍;李蒙;張禮;;一種針對(duì)RNA-Seq數(shù)據(jù)的基因異構(gòu)體表達(dá)水平計(jì)算方法[J];中國(guó)生物醫(yī)學(xué)工程學(xué)報(bào);2013年04期
相關(guān)博士學(xué)位論文 前1條
1 曾p瑤;基于小鼠15個(gè)組織RNA-seq數(shù)據(jù)的全基因組重注釋[D];中國(guó)科學(xué)院北京基因組研究所;2015年
相關(guān)碩士學(xué)位論文 前8條
1 陳\,
本文編號(hào):2368351
本文鏈接:http://sikaile.net/shoufeilunwen/benkebiyelunwen/2368351.html
最近更新
教材專著