天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于轉(zhuǎn)錄終點(diǎn)信號或保守性的大腸桿菌sRNA預(yù)測研究

發(fā)布時(shí)間:2018-08-10 07:27
【摘要】:細(xì)菌sRNA是細(xì)菌中普遍存在的一類長度在40~500個(gè)核苷酸的調(diào)控小分子RNA(small regulatory RNA),主要位于基因間區(qū),但也有位于蛋白編碼基因5’端和3’端非編碼區(qū)的情況。與通常的非編碼RNA如tRNA或rRNA不同,細(xì)菌sRNA不僅長度變化范圍很大,也沒有保守的二級結(jié)構(gòu)特征。 目前的研究表明,細(xì)菌sRNA主要通過與靶標(biāo)mRNA或靶標(biāo)蛋白質(zhì)的結(jié)合,廣泛參與多種生命活動的調(diào)控過程來應(yīng)對環(huán)境變化,如質(zhì)粒復(fù)制、噬菌體發(fā)育、壓力反應(yīng)、群體感應(yīng)、細(xì)菌毒性和鐵的動態(tài)平衡調(diào)節(jié)等;其次,在目前已測序的上千個(gè)細(xì)菌基因組中,僅在E.coli等少數(shù)基因組中得到了較充分研究,還有大量的細(xì)菌sRNA等待發(fā)現(xiàn)。因此,開展細(xì)菌sRNA的識別研究具有重要意義。 然而,開展基因組水平sRNA發(fā)現(xiàn)的實(shí)驗(yàn)研究存在很多缺點(diǎn),如操作過程復(fù)雜、周期長、準(zhǔn)確性低和有的sRNA只有在特定的環(huán)境下才能表達(dá)等。目前一般采用生物信息學(xué)預(yù)測和實(shí)驗(yàn)驗(yàn)證相結(jié)合的策略來識別細(xì)菌sRNA。因此,開展sRNA的生物信息學(xué)預(yù)測研究具有重要意義,可以加快sRNA的發(fā)現(xiàn)進(jìn)程。其次,隨著大量的多種類型細(xì)菌基因組測序工作的完成和各種RNA數(shù)據(jù)庫的構(gòu)建,也為開發(fā)sRNA基因的生物信息學(xué)預(yù)測方法提供了數(shù)據(jù)基礎(chǔ)。 與蛋白編碼基因具有易于識別的特征不同,sRNA編碼基因通常沒有明確的編碼特征,且不受移碼或無義突變的影響,因此需要發(fā)展專門的生物信息學(xué)預(yù)測方法,目前已發(fā)展的方法主要分為三類:比較基因組學(xué)方法、尋找轉(zhuǎn)錄信號方法和機(jī)器學(xué)習(xí)方法。 基于比較基因組學(xué)方法尋找sRNA,其理論依據(jù)是sRNA基因在相近種屬的基因組中具有一定的序列保守性和結(jié)構(gòu)保守性。目前這種方法比較常用,但此方法不能預(yù)測出一個(gè)細(xì)菌特異的sRNA基因;其次還必須有相近物種的基因組信息可以利用;最后是保守的基因間區(qū)可能是其它類型的基因結(jié)構(gòu),不一定是sRNA,也不能識別位于編碼區(qū)反義鏈的sRNA基因。 基于轉(zhuǎn)錄信號尋找sRNA,其基本方法是在基因間區(qū)尋找潛在的啟動子或者轉(zhuǎn)錄因子結(jié)合位點(diǎn)和Rho-非依賴終止子結(jié)構(gòu)來發(fā)現(xiàn)sRNA。由于目前預(yù)測啟動子或者轉(zhuǎn)錄因子結(jié)合位點(diǎn)的假陽性率較高,相應(yīng)地預(yù)測sRNA的假陽性率也較高;其次不能預(yù)測出具有Rho-依賴性終止子結(jié)構(gòu)的sRNA。 利用機(jī)器學(xué)習(xí)方法預(yù)測sRNA,其基本假設(shè)是細(xì)菌sRNA基因序列部分一定與其余部分是可以區(qū)分的。然而,在利用機(jī)器學(xué)習(xí)方法提取特征時(shí),通常要對序列片段進(jìn)行窗口化處理,例如在有些研究中窗口取50nt,由于sRNA長度變化很大,很難獲得最佳的窗口大小。 為了克服上述方法的一些缺點(diǎn),我們對sRNA預(yù)測方法進(jìn)行了深入思考,以便為實(shí)驗(yàn)發(fā)現(xiàn)sRNA提供更好的支持。為此,我們提出了基于轉(zhuǎn)錄終點(diǎn)信號預(yù)測sRNA以及利用保守性分析來預(yù)測sRNA兩種方法,并在大腸桿菌基因組中進(jìn)行sRNA預(yù)測。 基于轉(zhuǎn)錄終點(diǎn)特征的預(yù)測方法,其基本假設(shè)是細(xì)菌sRNA在長期進(jìn)化過程中,在基因組中它們的起點(diǎn)與終點(diǎn)形成特有的序列與結(jié)構(gòu)模式,sRNA的起點(diǎn)與終點(diǎn)模式不會在基因組中隨機(jī)分布。通過對大腸桿菌中的已知sRNA分析發(fā)現(xiàn),細(xì)菌sRNA 5’端信號較弱,而3’端序列信號較強(qiáng),為此,我們提出了基于細(xì)菌sRNA轉(zhuǎn)錄終點(diǎn)特征預(yù)測模型,可以比較準(zhǔn)確地預(yù)測出sRNA的終點(diǎn)位置。此模型用堿基頻率矩陣來描述細(xì)菌sRNA的轉(zhuǎn)錄終點(diǎn)特征,并用統(tǒng)計(jì)學(xué)方法來區(qū)分陽性數(shù)據(jù)集和陰性數(shù)據(jù)集。通過陽性訓(xùn)練集中63個(gè)樣本和陰性訓(xùn)練集中隨機(jī)生成10萬個(gè)樣本來構(gòu)建模型,在閾值為28.9524時(shí),訓(xùn)練集的敏感性和特異性分別為34.92%和100.00%,模型的PPV達(dá)到最大值100.00%;對測試集進(jìn)行預(yù)測,陽性測試集為22個(gè)樣本,陰性測試集為10000個(gè)樣本,預(yù)測結(jié)果的敏感性和特異性分別為4.30%和99.99%,此時(shí)的陽性檢出率PPV為90.90%。模型的特異性和PPV很高,可以為實(shí)驗(yàn)驗(yàn)證提供很好的支持。 基于保守性來預(yù)測sRNA,是基于在相近種屬中已知的sRNA在進(jìn)化上具有保守性,并且sRNA的Rho-非依賴性終止子結(jié)構(gòu)既為重要的功能元件,在進(jìn)化上也具有一定的保守性,所以我們認(rèn)為在多個(gè)種屬中具有保守的Rho-非依賴的終止子以及一定長度的保守序列片段才有可能是候選sRNA;诖思僭O(shè),我們在大腸桿菌的基因間區(qū)尋找Rho-非依賴的終止子,并對其進(jìn)行保守性分析,確定為保守Rho-非依賴的終止子后,對其及其上游片段在腸桿菌科39個(gè)基因組中進(jìn)行保守性分析,如果其上游保守片段長度在20nt以上認(rèn)為是sRNA。當(dāng)取保守基因組個(gè)數(shù)為7時(shí),在6340條基因間區(qū)中預(yù)測出可能的sRNA 335條,預(yù)測出已知sRNA 65條中的21條,模型敏感性為32.3%,特異性為94.4%。特異性與sRNAPredict2相當(dāng),敏感性高于sRNAPredict2敏感性12個(gè)百分點(diǎn)。說明在用序列保守性和Rho-非依賴的終止子預(yù)測sRNA方法得到了進(jìn)一步提升。
[Abstract]:Bacterial sRNA is a common type of small molecule RNA (small regulatory RNA), which is commonly found in 40~500 nucleotides, which is mainly located in the intergenic region, but also in the non coding region of the 5 'and 3' ends of the protein encoding gene. Unlike the usual non coded RNA, such as tRNA or rRNA, bacteria sRNA not only has a wide range of variation in length, but also in the normal non coded RNA, such as tRNA or rRNA. There is no conservative two - level structural feature.
Current studies have shown that bacterial sRNA is mainly involved in the regulation of various biological activities through the combination of target mRNA or target protein to respond to environmental changes, such as plasmid replication, phage development, stress response, quorum induction, bacterial toxicity, and iron dynamic balance regulation. Secondly, over a thousand bacterial bases that have been sequenced at present. In the group, only a few genomes such as E.coli have been fully studied, and a large number of bacterial sRNA are waiting to be found. Therefore, it is of great significance to carry out the recognition and study of bacterial sRNA.
However, there are many shortcomings in the experimental study of genome level sRNA discovery, such as complex operation process, long period, low accuracy, and some sRNA can only be expressed in a specific environment. At present, bioinformatics prediction and experimental verification are commonly used to identify bacterial sRNA. and carry out sRNA bioinformatics. The prediction research is of great significance and can accelerate the discovery process of sRNA. Secondly, with the completion of a large number of various types of bacterial genome sequencing and the construction of various RNA databases, it also provides a data basis for the development of the bioinformatics prediction method of sRNA gene.
Unlike protein coding genes, which are easy to identify, sRNA coding genes usually have no specific coding characteristics and are not affected by transcoding or nonsense mutation. Therefore, special bioinformatics prediction methods need to be developed. The methods that have been developed are divided into three categories: comparative genomics method, search for transcription signal methods and machines. Tool learning method.
SRNA based on comparative genomics is based on the theory that the sRNA gene has a certain sequence conservatism and conservatism in the genomes of similar genus, but this method is often used, but this method can not predict a bacterial specific sRNA gene; secondly, the genome information of similar species must be used; Finally, conserved intergenic regions may be other types of gene structure, not necessarily sRNA, and cannot identify sRNA genes located in the antisense chain of the coding region.
The basic method of finding sRNA based on the transcriptional signal is to find the potential promoter or transcription factor binding site and the Rho- non dependent terminator structure in the intergenic region to find that the false positive rate of sRNA. is higher than that of the promoter or transcription factor binding site, and the false positive rate of sRNA is also higher; secondly, it can not be predefined. Detection of sRNA. with Rho- dependent termination substructure
Using machine learning methods to predict sRNA, the basic assumption is that the sequence of sRNA gene sequences must be distinguished from the rest. However, when using machine learning methods to extract features, it is usually necessary to make a window processing of sequence fragments, for example, in some of the research windows, it is difficult to obtain the 50nt because of the large change in the length of the sRNA. Good window size.
In order to overcome the shortcomings of the above methods, we think deeply about the sRNA prediction method so as to provide better support for the experimental discovery of sRNA. Therefore, we propose two methods based on the prediction of the transcription end point signal and the use of conservatism analysis to predict the two methods of the sRNA, and to predict the sRNA in the Escherichia coli genome.
The basic hypothesis of the prediction method based on the characteristics of the transcriptional endpoint is that the bacterial sRNA forms a unique sequence and structure pattern at the beginning and end of the genome in the long evolution process. The starting point and terminal pattern of sRNA will not be randomly distributed in the genome. By the known sRNA analysis in Escherichia coli, the bacterial sRNA 5 'end is found. The signal is weak and the signal of the 3 'end sequence is strong. Therefore, we propose a prediction model based on the characteristics of the bacterial sRNA transcriptional endpoint, which can predict the terminal position of sRNA more accurately. This model uses the base frequency matrix to describe the characteristics of the transcriptional end point of bacterial sRNA, and uses statistical methods to distinguish positive data sets and negative data sets. 63 samples and negative training centers were collected and 100 thousand samples were randomly generated to build the model. When the threshold was 28.9524, the sensitivity and specificity of the training set were 34.92% and 100% respectively. The PPV of the model reached the maximum value of 100%. The test set was predicted, the positive test set was 22 samples and the negative test set was 10000 samples. The sensitivity and specificity of the prediction results are 4.30% and 99.99% respectively. The positive detection rate of PPV at this time is the specificity of the 90.90%. model and the high PPV, which can provide good support for the experimental verification.
The prediction of sRNA based on conservatism is based on the conservatism of the sRNA known in the similar genus, and the Rho- non dependent terminator structure of sRNA is an important functional element and has a certain conservatism in evolution, so we think that there is a conservative Rho- non dependent terminator and a certain number of species in many species. It is possible that the length of the conservative sequence is a candidate sRNA. based on this hypothesis. We find Rho- non dependent terminator in the intergenic region of Escherichia coli, and analyze it conservatively, and determine the conservatism of the terminator for conserving Rho- and its upstream fragment in the 39 genome of Enterobacteriaceae. The length of the upstream conservative fragment is above 20nt. When the number of conservative genome is 7, the possible sRNA 335 is predicted in the 6340 intergenic region, and 21 of the 65 known sRNA are predicted. The sensitivity of the model is 32.3%, the specificity is 94.4%. specificity and sRNAPredict2, and the sensitivity is higher than the sRNAPredict2 sensitivity 12 percent. The results show that the method of predicting sRNA with conservative sequences and Rho-independent terminators has been further improved.
【學(xué)位授予單位】:中國人民解放軍軍事醫(yī)學(xué)科學(xué)院
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2011
【分類號】:R378

【參考文獻(xiàn)】

相關(guān)期刊論文 前2條

1 王立貴;應(yīng)曉敏;曹源;查磊;李伍舉;;sRNASVM——基于SVM方法構(gòu)建大腸桿菌sRNA預(yù)測模型(英文)[J];生物物理學(xué)報(bào);2009年04期

2 王立貴;趙雅琳;李伍舉;;細(xì)菌sRNA基因及其靶標(biāo)預(yù)測研究進(jìn)展[J];微生物學(xué)報(bào);2009年01期

,

本文編號:2175381

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/xiyixuelunwen/2175381.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶ac033***提供,本站僅收錄摘要或目錄,作者需要?jiǎng)h除請E-mail郵箱bigeng88@qq.com