驗(yàn)證酵母基因組序列中8-mer的獨(dú)立進(jìn)化規(guī)律和生物學(xué)功能
發(fā)布時(shí)間:2018-08-02 17:39
【摘要】:全基因組序列k-mer的使用是非隨機(jī)的,不同種類的k-mer具有不同的生物學(xué)功能,發(fā)掘k-mer使用規(guī)律以及k-mer的生物學(xué)功能對(duì)于基因組結(jié)構(gòu)進(jìn)化和系統(tǒng)理解功能片段非常重要。上百個(gè)物種的k-mer頻譜研究發(fā)現(xiàn)四足動(dòng)物的k-mer頻譜是多峰分布,其他生物的k-mer頻譜是單峰分布。K-mer多峰譜產(chǎn)生的原因眾說紛紜,有研究指出不同類型的功能或結(jié)構(gòu)元件是產(chǎn)生多峰譜的主要原因,也有研究認(rèn)為多峰譜是以G+C含量和CpG抑制為特征,還有研究認(rèn)為多峰是由兩類稀有k-mer形成的。所以基因組k-mer頻譜產(chǎn)生的原因仍待研究。論文運(yùn)用統(tǒng)計(jì)分析和生物信息學(xué)等方法,結(jié)合人類k-mer頻譜的分布規(guī)律,研究了酵母基因組序列k-mer頻譜的規(guī)律,探討了 CG類8-mer子集的獨(dú)立進(jìn)化機(jī)制,對(duì)CG類模體的生物學(xué)功能給出理論猜測(cè)和驗(yàn)證。主要研究?jī)?nèi)容如下:(1)計(jì)算得到人類1號(hào)染色體序列的8-mer相對(duì)模體數(shù)隨頻次的分布(簡(jiǎn)稱8-mer頻譜),發(fā)現(xiàn)8-mer頻譜是三峰分布。將全部8-mer按照16種XY二核苷分類分成三個(gè)子集后,發(fā)現(xiàn)僅有CG二核苷分類下的三個(gè)子集CG0(不包含CG二核苷的8-mer)、CG_1(包含一個(gè)CG的8-mer)和CG_2(包含兩個(gè)或兩個(gè)以上CG的8-mer)各自形成獨(dú)立的單峰分布,稱之為CG類模體的獨(dú)立進(jìn)化規(guī)律。三個(gè)CG模體子集的分布位置與總體8-mer分布的三個(gè)峰嚴(yán)格對(duì)應(yīng)。由此得出三個(gè)CG子集分布距離的遠(yuǎn)近是決定單峰分布還是多峰分布的直接原因。與隨機(jī)序列的8-mer頻譜比較,發(fā)現(xiàn)CG0模體的頻譜位于隨機(jī)中心附近,CG_1和CG_2模體的頻譜遠(yuǎn)離隨機(jī)中心。表明包含CG二核苷的8-mer是定向進(jìn)化,不包含CG二核苷的8-mer是隨機(jī)進(jìn)化。CG三個(gè)子集的分布具有兩個(gè)特征:(i)CG_2和CG_1分布的最概然頻次明顯低于CG0分布;(ii)CG_2和CG_1分布的寬度明顯窄于CG0分布。這兩特征表明CG_2和CG_1子集中的8-mer使用是保守的。分析三個(gè)CG子集、核小體中心序列(NCSs)和CpG島(CGIs)的序列特征后,提出兩個(gè)理論猜想:CG_1模體是核小體結(jié)合模體;CG_2模體是CGIs的模體單元。(2)酵母基因組序列的8-mer頻譜為單峰分布。計(jì)算酵母中16種二核苷分類下8-mer相對(duì)模體數(shù)隨頻次的分布,發(fā)現(xiàn)只有CG子集分布具備人類CG子集分布的兩個(gè)特征,表明酵母中CG_2和CG_1子集中的8-mer使用也是保守的,以及酵母的單峰分布是三個(gè)CG子集分布太近疊加后的結(jié)果。因此得到這樣的結(jié)論:CG模體使用的進(jìn)化獨(dú)立規(guī)律從最簡(jiǎn)單的真核生物酵母就開始了。由于CG子集模體數(shù)目眾多,用三個(gè)CG子集中m-mer(m=2,3,4)的頻率來表征CG子集的模體信息。首先分析發(fā)現(xiàn)三個(gè)CG子集模體信息偏離總體8-mer的程度各不相同。然后考察了酵母基因組序列在16種XY1分類下m-mer使用的總偏離(新對(duì)稱相對(duì)熵NSRE),發(fā)現(xiàn)CG分類下的模體使用偏離最大。得出CG二核苷在從簡(jiǎn)單到復(fù)雜的基因組進(jìn)化中是功能元件產(chǎn)生和進(jìn)化"核心"的結(jié)論。(3)為了驗(yàn)證CG_1模體是否是核小體結(jié)合模體,分別將CG0、CG_1和CG_2子集的模體信息賦值到酵母的核小體中心序列和連接序列上做二分類評(píng)估。結(jié)果指出基于CG_1模體信息得到的平均ROC面積(AUC)最大,說明CG_1模體比起CG0和CG_2模體更偏好核小體中心序列。然后基于CG_1子集模體信息得到核小體中心序列上的NSRE分布,該分布與已出版的結(jié)果一致。結(jié)果顯示富含模體決定核小體的基本框架,稀有模體決定核小體的精細(xì)結(jié)構(gòu)。將標(biāo)準(zhǔn)組蛋白八聚體沿著DNA雙鏈展開成一維排列后,NSRE分布的極大值區(qū)域與八個(gè)組蛋白位置存在極好的一一對(duì)應(yīng)關(guān)系。這兩個(gè)結(jié)果共同驗(yàn)證了 CG_1模體是核小體結(jié)合模體的猜想。(4)統(tǒng)計(jì)分析單堿基精度核小體位置數(shù)據(jù),發(fā)現(xiàn)一些核小體處于擠壓狀態(tài)。根據(jù)擠壓的位置將核小體分為四類:標(biāo)準(zhǔn)核小體;上游擠壓核小體;下游擠壓核小體;兩端擠壓核小體;贑G_1模體是核小體結(jié)合模體的結(jié)論,分析了四類核小體中心序列上NSRE的分布特征,發(fā)現(xiàn)擠壓核小體隨著擠壓端和非擠壓端序列結(jié)構(gòu)的變化而變化,而且核小體受擠壓的區(qū)域其序列的組織性更強(qiáng)。隨后,核小體連接序列按長(zhǎng)度增長(zhǎng)的方式分類為11個(gè)長(zhǎng)度組,利用MEME在線軟件搜索了 11個(gè)長(zhǎng)度組中的保守模體,發(fā)現(xiàn)有四類保守模體,意味著連接序列的多樣性。(5)為了驗(yàn)證CG_2模體是否是CGIs的模體單元,分別將CG_2、CG_1和CG0模體信息賦值到酵母的CGIs和相應(yīng)的非CpG島序列上做ROC分析,得到的平均AUC值分別為0.95,0.80和0.02,顯示CG_2模體信息與CGIs的構(gòu)成信息非常符合。在ROC曲線上選取最佳臨界值,計(jì)算該臨界值下的總精度(AAC)和相關(guān)系數(shù)(MCC),該結(jié)果進(jìn)一步確認(rèn)了 CG_2模體信息可以表征CGIs序列,從而驗(yàn)證了 CG_2模體是CGIs的結(jié)構(gòu)單元。
[Abstract]:The use of the whole genome sequence k-mer is nonrandom. Different kinds of k-mer have different biological functions. The discovery of k-mer usage and the biological function of k-mer are very important for the genome structure evolution and systematic understanding of functional fragments. The k-mer spectrum of hundreds of species found that the k-mer spectrum of quadruped is a multi peak. The k-mer spectrum of cloth and other organisms is the cause of the generation of the multi peak spectrum of the single peak distribution of.K-mer. Some research points out that different types of functional or structural elements are the main reasons for the generation of multi peak spectrum. There are also studies that the multi peak spectrum is characterized by G+C content and CpG suppression, and that the multi peak is formed by two kinds of rare k-mer. The cause of the k-mer spectrum is still to be studied. By means of statistical analysis and bioinformatics, this paper studies the law of k-mer spectrum in the yeast genome sequence and discusses the independent evolution mechanism of the 8-mer subset of the CG class, and gives a theoretical guess and test for the biological function of the CG class 8-mer. The main research contents are as follows: (1) the number of 8-mer relative modules of the human chromosome 1 sequence was calculated with the frequency distribution (8-mer spectrum), and the 8-mer spectrum was found to be the three peak distribution. After dividing all 8-mer into three subsets according to the classification of XY two nucleosides, only three subset of CG0 (not including CG two nucleosides) was found. 8-mer), CG_1 (including a 8-mer of CG) and CG_2 (including two or more than two CG 8-mer) each forms an independent single peak distribution, which is called the independent evolution law of CG class modules. The distribution of the three CG module subset is strictly corresponding to the three peaks of the overall 8-mer distribution. Thus, the distance and proximity of the three CG subset distribution distance is the decision sheet. The peak distribution is the direct cause of the multi peak distribution. Compared with the 8-mer spectrum of random sequence, the spectrum of the CG0 model body is located near the random center, and the spectrum of CG_1 and CG_2 modules is far from the random center. It shows that the 8-mer containing CG two nucleosides is directed evolution, and the 8-mer that does not contain CG two nucleosides is a random evolution.CG three subset distribution with two Characteristics: (I) the most probability of CG_2 and CG_1 distribution is obviously lower than the CG0 distribution; (II) the width of CG_2 and CG_1 distribution is narrower than CG0 distribution. These two features indicate that 8-mer use of CG_2 and CG_1 subsets is conservative. After analyzing the sequence characteristics of three CG subsets, nucleosome Central sequences and islands, two theoretical conjectures are proposed. The body is a nucleosome binding model body; the CG_2 model body is the module unit of the CGIs. (2) the 8-mer spectrum of the yeast genome sequence is a single peak distribution. The distribution of the relative modules of the 8-mer in the taxonomy of the 16 species of two nucleosides in yeast is calculated with the frequency distribution. It is found that only the CG subset distribution has two characteristics of the human CG subset distribution, indicating that the concentration of CG_2 and CG_1 in the yeast is 8. The use of -mer is also conservative, and the single peak distribution of yeast is the result of three CG subset distribution too close. Therefore, it is concluded that the evolutionary independence of CG modules begins with the simplest eukaryote yeast. As the number of CG subset modules is large, the frequency of m-mer (m=2,3,4) is used to characterize CG with the concentration of the m-mer (m=2,3,4) in the subset of the CG subsets. First, it is found that the degree of the three CG subset model body information deviates from the overall 8-mer. Then the total deviation (new symmetric relative entropy NSRE) of the yeast genome sequence under the 16 XY1 classifications (new symmetric relative entropy NSRE) is investigated. It is found that the use deviation of the model body under the CG classification is the largest. It is found that the CG two nucleosides are in the simple to complex basis. (3) in order to verify whether the CG_1 model body is a nucleosome binding model body, the model body information of the CG0, CG_1 and CG_2 subset is assigned to the yeast nucleosome center sequence and the connection sequence, respectively. The results indicate the average ROC area based on the CG_1 model body information. AUC) maximum, indicating that the CG_1 module preferred the nucleosome center sequence more than the CG0 and CG_2 modules. Then, based on the CG_1 subset model body information, the NSRE distribution on the nucleosome center sequence is obtained. The distribution is in accordance with the published results. The results show that the basic framework of the nucleosomes is determined by the model body, and the rare model determines the fine structure of the nucleosome. After the paraminin eight polymer is arranged in one dimension along the DNA double strand, the maximum region of the NSRE distribution has an excellent one-to-one correspondence with the position of the eight histone. These two results jointly verify that the CG_1 module is the conjecture of the nucleosome binding mode body. (4) statistical analysis of the location data of the mono base nucleosome, and the discovery of some nucleosomes The nucleosome is divided into four types according to the position of extrusion: the standard nucleosome, the upstream extruding nucleosome, the downstream extruding nucleosome, the extruding nucleosome at the two ends. Based on the conclusion of the nucleosome binding die body, the CG_1 model body has analyzed the distribution characteristics of the NSRE in the central sequence of the nucleosome, and found that the extruded nucleosome is with the extrusion end and non extrusion. The sequence of the pressure end sequence changes, and the region of the nucleosome is squeezed is more organized. Then, the nucleosome connection sequence is classified into 11 length groups according to the length of the length, and the MEME online software is used to search the conservative modules of the 11 length groups, and four kinds of conservative modules are found, which means the diversity of the connection sequences. (5) (5) in order to verify whether the CG_2 module is a module unit of CGIs, the CG_2, CG_1 and CG0 module information is assigned to the CGIs of yeast and the corresponding non CpG Island sequence for ROC analysis. The average AUC values are 0.95,0.80 and 0.02 respectively, showing that the CG_2 module information is very consistent with the information of the CGIs. The total accuracy (AAC) and the correlation coefficient (MCC) under the critical value are calculated. The results further confirm that the CG_2 module information can characterize the CGIs sequence, thus verifying that the CG_2 module is a structural unit of the CGIs.
【學(xué)位授予單位】:內(nèi)蒙古大學(xué)
【學(xué)位級(jí)別】:博士
【學(xué)位授予年份】:2017
【分類號(hào)】:Q78
本文編號(hào):2160180
[Abstract]:The use of the whole genome sequence k-mer is nonrandom. Different kinds of k-mer have different biological functions. The discovery of k-mer usage and the biological function of k-mer are very important for the genome structure evolution and systematic understanding of functional fragments. The k-mer spectrum of hundreds of species found that the k-mer spectrum of quadruped is a multi peak. The k-mer spectrum of cloth and other organisms is the cause of the generation of the multi peak spectrum of the single peak distribution of.K-mer. Some research points out that different types of functional or structural elements are the main reasons for the generation of multi peak spectrum. There are also studies that the multi peak spectrum is characterized by G+C content and CpG suppression, and that the multi peak is formed by two kinds of rare k-mer. The cause of the k-mer spectrum is still to be studied. By means of statistical analysis and bioinformatics, this paper studies the law of k-mer spectrum in the yeast genome sequence and discusses the independent evolution mechanism of the 8-mer subset of the CG class, and gives a theoretical guess and test for the biological function of the CG class 8-mer. The main research contents are as follows: (1) the number of 8-mer relative modules of the human chromosome 1 sequence was calculated with the frequency distribution (8-mer spectrum), and the 8-mer spectrum was found to be the three peak distribution. After dividing all 8-mer into three subsets according to the classification of XY two nucleosides, only three subset of CG0 (not including CG two nucleosides) was found. 8-mer), CG_1 (including a 8-mer of CG) and CG_2 (including two or more than two CG 8-mer) each forms an independent single peak distribution, which is called the independent evolution law of CG class modules. The distribution of the three CG module subset is strictly corresponding to the three peaks of the overall 8-mer distribution. Thus, the distance and proximity of the three CG subset distribution distance is the decision sheet. The peak distribution is the direct cause of the multi peak distribution. Compared with the 8-mer spectrum of random sequence, the spectrum of the CG0 model body is located near the random center, and the spectrum of CG_1 and CG_2 modules is far from the random center. It shows that the 8-mer containing CG two nucleosides is directed evolution, and the 8-mer that does not contain CG two nucleosides is a random evolution.CG three subset distribution with two Characteristics: (I) the most probability of CG_2 and CG_1 distribution is obviously lower than the CG0 distribution; (II) the width of CG_2 and CG_1 distribution is narrower than CG0 distribution. These two features indicate that 8-mer use of CG_2 and CG_1 subsets is conservative. After analyzing the sequence characteristics of three CG subsets, nucleosome Central sequences and islands, two theoretical conjectures are proposed. The body is a nucleosome binding model body; the CG_2 model body is the module unit of the CGIs. (2) the 8-mer spectrum of the yeast genome sequence is a single peak distribution. The distribution of the relative modules of the 8-mer in the taxonomy of the 16 species of two nucleosides in yeast is calculated with the frequency distribution. It is found that only the CG subset distribution has two characteristics of the human CG subset distribution, indicating that the concentration of CG_2 and CG_1 in the yeast is 8. The use of -mer is also conservative, and the single peak distribution of yeast is the result of three CG subset distribution too close. Therefore, it is concluded that the evolutionary independence of CG modules begins with the simplest eukaryote yeast. As the number of CG subset modules is large, the frequency of m-mer (m=2,3,4) is used to characterize CG with the concentration of the m-mer (m=2,3,4) in the subset of the CG subsets. First, it is found that the degree of the three CG subset model body information deviates from the overall 8-mer. Then the total deviation (new symmetric relative entropy NSRE) of the yeast genome sequence under the 16 XY1 classifications (new symmetric relative entropy NSRE) is investigated. It is found that the use deviation of the model body under the CG classification is the largest. It is found that the CG two nucleosides are in the simple to complex basis. (3) in order to verify whether the CG_1 model body is a nucleosome binding model body, the model body information of the CG0, CG_1 and CG_2 subset is assigned to the yeast nucleosome center sequence and the connection sequence, respectively. The results indicate the average ROC area based on the CG_1 model body information. AUC) maximum, indicating that the CG_1 module preferred the nucleosome center sequence more than the CG0 and CG_2 modules. Then, based on the CG_1 subset model body information, the NSRE distribution on the nucleosome center sequence is obtained. The distribution is in accordance with the published results. The results show that the basic framework of the nucleosomes is determined by the model body, and the rare model determines the fine structure of the nucleosome. After the paraminin eight polymer is arranged in one dimension along the DNA double strand, the maximum region of the NSRE distribution has an excellent one-to-one correspondence with the position of the eight histone. These two results jointly verify that the CG_1 module is the conjecture of the nucleosome binding mode body. (4) statistical analysis of the location data of the mono base nucleosome, and the discovery of some nucleosomes The nucleosome is divided into four types according to the position of extrusion: the standard nucleosome, the upstream extruding nucleosome, the downstream extruding nucleosome, the extruding nucleosome at the two ends. Based on the conclusion of the nucleosome binding die body, the CG_1 model body has analyzed the distribution characteristics of the NSRE in the central sequence of the nucleosome, and found that the extruded nucleosome is with the extrusion end and non extrusion. The sequence of the pressure end sequence changes, and the region of the nucleosome is squeezed is more organized. Then, the nucleosome connection sequence is classified into 11 length groups according to the length of the length, and the MEME online software is used to search the conservative modules of the 11 length groups, and four kinds of conservative modules are found, which means the diversity of the connection sequences. (5) (5) in order to verify whether the CG_2 module is a module unit of CGIs, the CG_2, CG_1 and CG0 module information is assigned to the CGIs of yeast and the corresponding non CpG Island sequence for ROC analysis. The average AUC values are 0.95,0.80 and 0.02 respectively, showing that the CG_2 module information is very consistent with the information of the CGIs. The total accuracy (AAC) and the correlation coefficient (MCC) under the critical value are calculated. The results further confirm that the CG_2 module information can characterize the CGIs sequence, thus verifying that the CG_2 module is a structural unit of the CGIs.
【學(xué)位授予單位】:內(nèi)蒙古大學(xué)
【學(xué)位級(jí)別】:博士
【學(xué)位授予年份】:2017
【分類號(hào)】:Q78
【參考文獻(xiàn)】
相關(guān)期刊論文 前1條
1 尼瑪達(dá)瓦;李宏;周德良;鄭燕;楊小希;;酵母核小體中心序列與連接序列的差異分析[J];內(nèi)蒙古大學(xué)學(xué)報(bào)(自然科學(xué)版);2015年02期
,本文編號(hào):2160180
本文鏈接:http://sikaile.net/shoufeilunwen/jckxbs/2160180.html
最近更新
教材專著