基于Relief算法的siRNA特征選擇研究
發(fā)布時間:2018-03-23 01:32
本文選題:siRNA 切入點:siRNA干擾效率 出處:《吉林大學》2017年碩士論文 論文類型:學位論文
【摘要】:RNA干擾(Ribonucleic Acid interference)是通過將雙鏈RNA導(dǎo)入生物體內(nèi),使目標基因出現(xiàn)表達沉默的一種生物技術(shù)。設(shè)計高抑制率的siRNA是RNA干擾技術(shù)的重要前提條件。由于完全依靠生物實驗的方法來設(shè)計高效的siRNA,投入生物實驗資金高、花費時間較長、效率低下,所以通過計算機信息技術(shù)來先行優(yōu)化高抑制率的siRNA設(shè)計,是一種RNA干擾技術(shù)的可靠途徑。借助生物信息技術(shù)的siRNA設(shè)計是對已有的實驗數(shù)據(jù)集的機器學習并構(gòu)建預(yù)測模型,用戶輸入靶標m RNA序列,輸出候選的高抑制率的siRNA序列,然后只需進行若干次的生物實驗驗證。目前已有一些siRNA預(yù)測軟件,但大多數(shù)都是僅基于siRNA自身的序列特征,導(dǎo)致預(yù)測的準確性不高;有些軟件雖然特征集選擇較全面,但是沒有先進行特征選擇這樣一個重要的“數(shù)據(jù)預(yù)處理”過程,導(dǎo)致構(gòu)建預(yù)測模型的程序運行非常耗時,而且準確性也會較低。在現(xiàn)實機器學習任務(wù)中,獲得數(shù)據(jù)之后通常先進行特征選擇很有必要,在后階段的訓練學習器時也會提高程序的運行效率。過濾式特征選擇是先對數(shù)據(jù)集進行特征選擇,然后再進行訓練學習器的步驟,這種特征選擇方法的過程與后續(xù)學習器無關(guān)。過濾式特征選擇算法在評價特征時,通過對數(shù)據(jù)的所有特征進行相應(yīng)權(quán)重的評分,并且此過程中不會通過構(gòu)建模型來完成。在對特征集給出相應(yīng)的權(quán)重評分之后,權(quán)重值小于設(shè)定的閾值的特征將會被移除,高于設(shè)定閾值的部分特征會被保留,并接著被用以進行特征分析或者分類處理、構(gòu)建特征關(guān)系模型。本文對目前常用的siRNA的107個特征,根據(jù)實驗數(shù)據(jù)集的實際分布,合理設(shè)計了Relief特征選擇算法的具體流程。實驗結(jié)果選擇出了88個相關(guān)特征;移除了19個無關(guān)特征。我們用88個相關(guān)特征訓練隨機森林預(yù)測模型,10折交叉驗證的相關(guān)系數(shù)從0.629提高到0.640,同時也提高了構(gòu)建隨機森林預(yù)測模型的效率、降低了siRNA軟件運行的時間復(fù)雜度。本文還得到siRNA抑制率和siRNA雙鏈5’端的能量差在統(tǒng)計上有明顯的正相關(guān)關(guān)系,即siRNA雙鏈5’端的能量差越高,siRNA的抑制率越高;相反,siRNA雙鏈5’端的能量差越低,siRNA的抑制率越低。之后我們在Dieter Huesken數(shù)據(jù)集進行了統(tǒng)計分析,結(jié)果為:(1)反義鏈5’端的第1位置應(yīng)該是A或者U,非G、C;(2)第2位應(yīng)該是A或者U,非G、C;(3)第7位應(yīng)該是非C;(4)第14位應(yīng)該是非G;
[Abstract]:RNA interferes with ribonucleic Acid interference by introducing double-stranded RNA into organisms. The design of siRNA with high inhibition rate is an important precondition of RNA interference technology. Because it completely depends on the method of biological experiment to design highly efficient siRNAs, it has a high investment in biological experiments. It takes a long time and is inefficient, so computer information technology is used to optimize the siRNA design with high inhibition rate. The design of siRNA with the help of bioinformatics is to learn from the existing experimental data sets and build a prediction model. The user inputs the target m RNA sequence and outputs candidate siRNA sequences with high suppression rate. At present, there are some siRNA prediction software, but most of them are only based on the sequence features of siRNA itself, which leads to the low accuracy of prediction. But without feature selection as an important "data preprocessing" process, the program that builds the prediction model is time-consuming and less accurate. It is necessary to select features first after obtaining data, and it will also improve the efficiency of the program when training the learner in the later stage. Filtering feature selection is the step of feature selection for the data set first and then training the learner. The process of this feature selection method is independent of the follower. When evaluating the features, the filtering feature selection algorithm scores the corresponding weights on all the features of the data. After giving the corresponding weight score to the feature set, the feature whose weight value is less than the set threshold will be removed, and some features above the set threshold will be preserved. Then it is used for feature analysis or classification to construct the feature relationship model. In this paper, according to the actual distribution of the experimental data set, the 107 features of siRNA, which are commonly used at present, are analyzed. The specific flow chart of the Relief feature selection algorithm is designed reasonably and 88 related features are selected from the experimental results. We use 88 correlation features to train the random forest prediction model from 0.629 to 0.640, and improve the efficiency of constructing the stochastic forest prediction model. The time complexity of siRNA software is reduced, and the statistical positive correlation between the inhibition rate of siRNA and the energy difference at the 5 'end of siRNA double strand is obtained, that is, the higher the energy difference of siRNA double strand 5' terminal is, the higher the inhibition rate of siRNA is. On the contrary, the lower the energy difference at the 5'end of siRNA is, the lower the inhibition rate of siRNA is. The results were as follows: (1) the first position of the 5'terminal of the antisense chain should be A or U, the second position should be A or U, and the second position should be A or U, and the seventh position should be non-Cf4) and the 14th position should be non-G;
【學位授予單位】:吉林大學
【學位級別】:碩士
【學位授予年份】:2017
【分類號】:Q78;TP181
【參考文獻】
相關(guān)期刊論文 前6條
1 李貞子;張濤;武曉巖;李康;;隨機森林回歸分析及在代謝調(diào)控關(guān)系研究中的應(yīng)用[J];中國衛(wèi)生統(tǒng)計;2012年02期
2 劉元寧;常亞萍;李妼;張浩;田明堯;;針對H1N1病毒的多特征siRNA設(shè)計[J];吉林大學學報(工學版);2010年03期
3 史毅;金由辛;;RNA干擾與siRNA(小干擾RNA)研究進展[J];生命科學;2008年02期
4 胡穎;葉楓;謝幸;;RNA干擾技術(shù)中siRNA設(shè)計原則的研究進展[J];國際遺傳學雜志;2007年06期
5 許德暉;黃辰;劉利英;宋土生;;高效siRNA設(shè)計的研究進展[J];遺傳;2006年11期
6 譚金祥,任國勝;RNA干擾技術(shù)的研究進展[J];重慶醫(yī)學;2005年02期
相關(guān)博士學位論文 前2條
1 常亞萍;siRNA設(shè)計中若干關(guān)鍵問題的研究[D];吉林大學;2013年
2 陳曉林;基于動態(tài)代價敏感的機器學習研究[D];華中科技大學;2010年
,本文編號:1651331
本文鏈接:http://sikaile.net/shoufeilunwen/benkebiyelunwen/1651331.html
最近更新
教材專著