天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

混合測序實驗設計及數據分析

發(fā)布時間:2018-01-17 22:28

  本文關鍵詞:混合測序實驗設計及數據分析 出處:《東南大學》2017年博士論文 論文類型:學位論文


  更多相關文章: 混合測序 群試 組合測序 稀有突變 稀有單倍型 個體單倍型構建 單核苷酸多態(tài)性


【摘要】:DNA測序技術最早可追溯至20世紀50年代通過化學降解測定多聚核糖核苷酸序列的方法。經過幾十年的努力,DNA測序技術空前發(fā)展,測序成本下降巨大。隨著第二代和第三代高通量測序技術的商業(yè)化,人類基因組測序成本已經降至一千美元。目前,測序技術正向著高通量、低成本、長測序片段的方向快速發(fā)展。盡管測序成本顯著下降,但對大量的個體進行全基因組測序依然十分昂貴,其所面臨的主要挑戰(zhàn)是對大量DNA樣本進行擴增以及文庫構建所帶來的巨大成本。為了充分利用當前測序的超高通量,混合測序應運而生,即將多個樣本混合在一起進行一次測序;旌蠝y序的一個主要問題是各樣本測序數據混合一起,需要采用Barcode技術以確定每條測序片段來自于哪個樣本。由于高通量測序技術中測序片段長度的限制,Barcode序列必須非常短,因此該技術能夠編碼的樣本數是非常有限的,而且對大量樣本進行特異序列連接也十分費時費力。2009年,Patterson等人提出了一種新的混合測序設計:組合測序,即對大量樣本進行組合混合并測序。在組合測序中,每個樣本被混合到多個混合池中,以樣本的混合模式作為一種編碼,用來標記每個樣本。在測序完成之后,利用一定的解碼方法,根據樣本的混合模式獲得屬于每個樣本的測序數據。與普通混合測序相比,組合測序還涉及編碼與解碼過程。編碼是指組合混合過程,即設計混合方案以保證每個樣本具有獨特的混合模式。解碼是指根據樣本的混合模式從混合測序結果中獲得屬于每個樣本測序數據的過程。本課題主要圍繞混合測序特別是組合測序的實驗設計和數據分析展開研究,首先構建了組合測序的優(yōu)化設計方案,隨后將其應用于稀有突變攜帶者篩選、稀有單倍型攜帶者篩選以及個體單倍型構建實驗中,最后發(fā)展了一種基于兩核苷酸實時合成測序技術的混合樣本單核苷酸多態(tài)性檢測方法,并使用真實混合測序實驗數據進行了驗證。本論文主要包括以下內容:1.設計并優(yōu)化用于篩選稀有突變攜帶者的組合測序方案。首先構建出混合測序的最優(yōu)測序深度模型以及組合測序的成本模型。然后使用群試領域中的混合矩陣設計,選擇最優(yōu)的設計參數以最大程度地降低測序成本并保證稀有突變攜帶者識別的準確度。考慮到混合樣本中樣本個數的限制以及混合測序所需要的超高深度,將大規(guī)模樣本分成數個小組并對每個小組進行獨立的組合測序會進一步降低稀有突變攜帶者篩選的成本。模擬結果表明限定測序區(qū)域長度為30Mb時,與對個體進行獨立測序的方案相比,使用優(yōu)化的組合測序從200個二倍體樣本中篩選1%的稀有突變攜帶者將會使成本降低至52%。為了利用混合測序結果中的定量信息,即攜帶突變的測序片段個數信息,借助于群試領域中的定量設計,我們提出了一種從大規(guī)模樣本中更高效的篩選出稀有突變攜帶者的組合測序方案。該方案使用隨機k-set矩陣來組合混合樣本,并設計了一個指示概率值以評價混合矩陣的性能。最終,使用啟發(fā)式貝葉斯解碼算法來識別突變攜帶者。利用公開可用的真實測序片段和人工模擬的混合測序結果,我們模擬了組合測序以從200株大腸桿菌中篩選出攜帶有稀有突變的菌株。結果顯示,該方案能夠準確地鑒定出91.5%-97.9%的稀有突變的攜帶者,其中稀有突變的頻率變化范圍為0.5%-1.5%。與基于普通群試方案的組合測序和已發(fā)表的壓縮測序方法相比,基于定量群試的組合測序方案表現(xiàn)更優(yōu),尤其是在降低測序數據需求量以及實驗成本上。2.發(fā)展了一種混合樣本中單倍型頻率估計及稀有單倍型攜帶者識別算法。借助于包含已知單倍型信息的先驗數據庫,我們提出了Ehapp來從混合測序結果中估計數據庫中各單倍型的比例。Ehapp將混合樣本中單倍型頻率估計問題轉換為對線性系統(tǒng)求稀疏解的問題并利用壓縮感知領域中的稀疏信號重構算法求解。當對包含10個單倍型的混合樣本進行50×深度的測序時,Ehapp估計的各單倍型比例的相對誤差在3%左右。即使當混合樣本中含有未知單倍型時,Ehapp依然能夠對混合樣本中含量高于0.05的已知單倍型的比例進行準確的估計。使用模擬測序結果以及公開可用的真實測序結果進行模擬,與現(xiàn)有算法相比,Ehapp在許多測序實驗設計中會表現(xiàn)更優(yōu)。通過使用Ehapp來估計混合樣本中各單倍型的比例,我們也揭示了利用組合測序篩選稀有單倍型攜帶的可行性。在Ehapp的基礎之上,我們進一步進行升級并提出了Ehapp2。與Ehapp不同的是,Ehapp2不再以單個SNP為基本單元,而以固定長度內的局部單倍型為基本單元。此外,Ehapp2還使用期望最大化算法來估計局部單倍型的比例,該算法能夠有效的利用測序質量值以降低測序錯誤的影響。大量模擬實驗顯示Ehapp2對測序錯誤不敏感,即使當測序錯誤率達到0.05的時候,對包含10個單倍型的混合樣本進行50×深度的測序,Ehapp2估計的各單倍型比例的誤差依然保持在3%左右。此外,由于Ehapp2以局部單倍型而非單個SNP作為基本的計算單元,所以Ehapp2能夠準確估計重組的單倍型的比例。與Ehapp和現(xiàn)有算法Harp進行比較的結果也顯示,Ehapp^2表現(xiàn)更優(yōu),而且更適用于當前的第二代高通量測序技術。Linux平臺下Ehapp和Ehapp2的下載地址分別為http://bioinfo.seu.edu.cn/Ehapp 和 http://bioinfo.seu.edu.cn/Ehapp2。3.構建了一種基于組合混合克隆測序的個體單倍型構建方案。對個體構建克隆文庫之后,采用一種隨機矩陣設計混合方案對大量的克隆進行組合混合并測序。隨后根據組合測序中每個克隆的混合模式,恢復出攜帶每個等位基因的所有克隆,從而恢復出每個克隆所攜帶的所有等位基因,即重構出克隆序列。最后,利用個體單倍型組裝軟件HapCUT連接各克隆以重構出個體單倍型序列。基于個體NA12878的二倍體基因組,我們模擬組裝出1號染色體的單倍型。最終組裝的單倍型序列中共有112條contig序列,N50長度為3.4Mb,且不包含翻轉錯誤。與現(xiàn)有方法相比,我們的方法具有更高的準確度。為了使該方法更容易使用,我們也編寫了相應的流程,具體的下載地址為http://bioinfo.seu.edu.cn/OPShap。4.提出了一種基于兩核苷酸實時測序技術的混合樣本單核苷酸多態(tài)性檢測方法。針對東南大學專利測序技術——兩核苷酸實時測序技術,我們提出了一種從混合DNA樣本檢測單核苷酸多態(tài)性的方法(Epds)。根據野生型序列與突變型序列信號譜之間存在的五種差異類型,我們采用枚舉算法來推測突變位置并估計對應的突變序列的比例。使用三種兩核苷酸添加方案,Epds能夠進一步識別出突變堿基。大量模擬實驗證明,當測序信號變異系數固定為0.0016時,從混合樣本中檢測比例高于0.02的單核苷酸多態(tài)性突變,Epds的準確度能夠達到89%以上。結果還顯示,Epds的假發(fā)現(xiàn)率僅僅為3%。與現(xiàn)有基于單核苷酸添加測序技術的混合樣本單核苷酸多態(tài)性檢測方法相比,Epds具有更好的表現(xiàn)。最終,我們實施真實混合測序實驗進行驗證的結果表明Epds能夠有效的應用于從混合樣本中檢測單核苷酸多態(tài)性。我們編寫出了 Epds 的代碼并公開在 http://bioinfo.seu.edu.cn/Epds。
[Abstract]:DNA sequencing technology can be traced back to 1950s, the determination method of polyribonucleotide sequence by chemical degradation. After decades of efforts, the unprecedented development of DNA sequencing technology, the cost of sequencing the huge decline. With the second and third generation high-throughput sequencing technology commercialization, the human genome sequencing costs have dropped to $one thousand. At present, sequencing technology is a high throughput, low cost, rapid development of long fragments direction. Although sequencing costs decreased significantly, but a large number of individual whole genome sequencing is still very expensive, the major challenges facing the huge cost brought by the construction of the library and the amplification of a large sample of DNA. In order to make full use of ultra high throughput the sequencing of mixed sequencing emerged, i.e. multiple samples mixed with a sequencing. A major problem is the variety of mixed sequencing The sequencing data mixed together, need to determine each fragment from which samples using Barcode technology. Because of high throughput sequencing technology in sequencing fragment length limit, Barcode sequence must be very short, so the technology can sample number encoding is very limited, and a large number of samples for sequence specific connection is also very time-consuming.2009 in 2008, Patterson et al. Proposed a new design of hybrid combination of sequencing, sequencing: the combination of mixing and sequencing of a large number of samples. The combination of sequencing, each sample was mixed into a mixing tank, a kind of encoding as in mixed mode sample, used to mark each sample. In sequence after the completion of the decoding method of sequencing data were obtained for each sample belongs to mixed mode according to the sample. Compared with the ordinary hybrid combination of sequencing, sequencing also involves encoding and decoding The process of encoding. Refers to the mixing process, namely the design of hybrid scheme to ensure that each sample has a unique mixed mode refers to the process of decoding. Each sample belongs to the sequencing data from the sequencing results in mixed samples. According to the mixed mode around the main topic of mixed sequencing especially experimental design and data analysis of the combination of sequencing first of all, build the optimal design of the combined sequencing, then applied to rare mutation carriers screening, screening rare haplotype carriers and individual haplotypes were constructed in the experiment, finally developed a mixed sample method for detection of single nucleotide polymorphisms of two nucleotide sequencing technology based on real-time synthesis, and verified using real mixed sequencing experimental data. This paper mainly includes the following contents: 1. design and Optimization for screening rare mutation carriers sequencing party Case. We build a hybrid sequencing optimal sequencing depth model and combined the cost of sequencing model. Then use the group to test the design of the mixing matrix in the field, select the optimal design parameters to minimize the cost of sequencing and ensure the accuracy of the identification of rare mutation carriers. Considering the number of samples in mixed sample and mixed constraints sequencing required high depth, combination of large-scale sample sequencing will be divided into several groups and independent of each group will further reduce the cost of screening rare mutation carriers. The simulation results show that the limited sequence length of 30Mb region, compared with the independent sequencing of individual programs, screening 1% from 200 diploid samples using a combination of sequencing optimization of rare mutation carriers will reduce the cost to 52%. in order to use quantitative information mixed sequencing results, namely carrying The number of mutations in the fragment sequencing information, quantitative test by means of the group in the field of design, we propose a large-scale sample selected from more efficient combination of rare mutation carriers sequencing scheme. The scheme using random k-set matrix mixed samples, and design a performance evaluation indicator probability values to the mixing matrix finally, using a heuristic Bayesian decoding algorithm to identify the mutation carriers. Using mixed sequencing results publicly available real fragments and artificial simulation, we simulated the combined sequencing to from 200 strains of Escherichia coli were selected with rare mutation strains. The results show that the scheme can accurately identify 91.5%-97.9% rare mutations the carriers, the frequency range of rare mutations in 0.5%-1.5%. and combination group testing scheme and sequencing of common published sequencing method based on compression Compared to group based on quantitative test sequencing scheme has better performance, especially the development of a mixed sample estimation of haplotype frequencies and rare haplotype carriers recognition algorithm in reducing the demand and the cost of sequencing data on.2.. With the help of the prior information contains a known haplotype database, we propose Ehapp from mixed sequencing the estimated haplotype database in the proportion of.Ehapp mixed samples in haplotype frequency estimation problem into a linear system of sparse solution problem and solved using sparse signal reconstruction algorithm of compressed sensing field. When sequencing mixed sample of 10 haplotypes was 50 x depth, the relative error of Ehapp estimation of the haplotype ratio around 3%. Even when the mixed samples containing unknown haplotypes, Ehapp can still on the content in mixed sample has higher than 0.05 The proportion of single times of accurate estimates. Using simulation results and real sequencing sequencing publicly available results are simulated, compared with the existing algorithms, Ehapp sequencing in many experimental design will perform better. Through the use of Ehapp to estimate the haplotypes of mixed sample than in the cases, we also reveal the feasibility of screening rare haplotypes carrying using a combination of sequencing. On the basis of Ehapp, we further upgrade and put forward Ehapp2. and Ehapp is different, Ehapp2 is no longer a single SNP as the basic unit, and partial haplotype fixed length in the basic unit. In addition, Ehapp2 also uses the expectation maximization algorithm to estimate the local haplotype proportion, the the algorithm can effectively use the sequencing quality value in order to reduce the impact of sequencing errors. A large number of simulation experiments show that Ehapp2 is not sensitive to sequencing errors, even when the sequencing error rate Up to 0.05 of the time, sequencing 50 x depth of the mixed sample contains 10 haplotypes, Ehapp2 estimated the error percentage of haplotype still at about 3%. In addition, due to the local Ehapp2 haplotype rather than a single SNP as the basic computing unit, so Ehapp2 can accurately estimate the recombination proportion of haplotypes. Compared with the results of Ehapp and Harp algorithms also show that Ehapp^2 has better performance, but also applies to the current second generation high-throughput sequencing technology on the platform of.Linux Ehapp and Ehapp2 http:// respectively bioinfo.seu.edu.cn/Ehapp download address and http://bioinfo.seu.edu.cn/Ehapp2.3. to build a construction program of mixed clone sequencing based on individual haplotypes. After constructing library the individual, using a stochastic matrix design for mixed and mixed with the sequencing of a large number of clones. According to the mixed mode of each combination of sequencing clones, recovered all the clones carrying each allele, in order to retrieve each clone carrying all alleles that reconstruct the clone sequence. Finally, the assembly software HapCUT connecting the clone to reconstruct individual haplotypes using individual haplotypes. Individual NA12878 diploid genome based on our simulation of assembled chromosome 1 haplotypes. The final assembly of the haplotypes are 112 contig sequences, N50 was 3.4Mb in length, and contain no flip error. Compared with the existing methods, our method has higher accuracy. In order to make the method more easy to use, we also write the corresponding the specific process, the download address is http://bioinfo.seu.edu.cn/OPShap.4. a two nucleotide sequencing real-time mixed sample detection based on single nucleotide polymorphism Test method for patent. Southeast University sequencing technology real-time two nucleotide sequencing technology, we propose a method of mixed DNA samples from the detection of single nucleotide polymorphisms (Epds). According to the five different types of wild type and mutation type sequence sequence between the signal spectrum exists, we use the enumeration algorithm to speculate and estimate the mutation sequence mutation position the corresponding ratio. Using three two nucleotide addition scheme, Epds can further identify mutations. Simulation results demonstrate that when the sequencing signal variation coefficient is fixed at 0.0016, higher than 0.02 of the SNP mutation from the detection of mixed samples, the accuracy of Epds can reach more than 89%. The results also showed that Epds the false discovery rate is only 3%. and the existing mixed sample detection method of single nucleotide polymorphism single nucleotide sequencing technology based on adding compared with Epds Finally, we implemented the real mixed sequencing experiment to verify that the Epds can be applied to detect single nucleotide polymorphisms from mixed samples. We compiled the code of Epds and published it in http://bioinfo.seu.edu.cn/Epds..

【學位授予單位】:東南大學
【學位級別】:博士
【學位授予年份】:2017
【分類號】:Q78


本文編號:1438280

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/shoufeilunwen/jckxbs/1438280.html


Copyright(c)文論論文網All Rights Reserved | 網站地圖 |

版權申明:資料由用戶ee741***提供,本站僅收錄摘要或目錄,作者需要刪除請E-mail郵箱bigeng88@qq.com