基于Relief算法的siRNA特征選擇研究

發(fā)布時間：2018-03-23 01:32

本文選題：siRNA　切入點：siRNA干擾效率　出處：《吉林大學》2017年碩士論文　論文類型：學位論文

【摘要】：RNA干擾(Ribonucleic Acid interference)是通過將雙鏈RNA導入生物體內,使目標基因出現(xiàn)表達沉默的一種生物技術。設計高抑制率的siRNA是RNA干擾技術的重要前提條件。由于完全依靠生物實驗的方法來設計高效的siRNA,投入生物實驗資金高、花費時間較長、效率低下,所以通過計算機信息技術來先行優(yōu)化高抑制率的siRNA設計,是一種RNA干擾技術的可靠途徑。借助生物信息技術的siRNA設計是對已有的實驗數(shù)據(jù)集的機器學習并構建預測模型,用戶輸入靶標m RNA序列,輸出候選的高抑制率的siRNA序列,然后只需進行若干次的生物實驗驗證。目前已有一些siRNA預測軟件,但大多數(shù)都是僅基于siRNA自身的序列特征,導致預測的準確性不高;有些軟件雖然特征集選擇較全面,但是沒有先進行特征選擇這樣一個重要的“數(shù)據(jù)預處理”過程,導致構建預測模型的程序運行非常耗時,而且準確性也會較低。在現(xiàn)實機器學習任務中,獲得數(shù)據(jù)之后通常先進行特征選擇很有必要,在后階段的訓練學習器時也會提高程序的運行效率。過濾式特征選擇是先對數(shù)據(jù)集進行特征選擇,然后再進行訓練學習器的步驟,這種特征選擇方法的過程與后續(xù)學習器無關。過濾式特征選擇算法在評價特征時,通過對數(shù)據(jù)的所有特征進行相應權重的評分,并且此過程中不會通過構建模型來完成。在對特征集給出相應的權重評分之后,權重值小于設定的閾值的特征將會被移除,高于設定閾值的部分特征會被保留,并接著被用以進行特征分析或者分類處理、構建特征關系模型。本文對目前常用的siRNA的107個特征,根據(jù)實驗數(shù)據(jù)集的實際分布,合理設計了Relief特征選擇算法的具體流程。實驗結果選擇出了88個相關特征;移除了19個無關特征。我們用88個相關特征訓練隨機森林預測模型,10折交叉驗證的相關系數(shù)從0.629提高到0.640,同時也提高了構建隨機森林預測模型的效率、降低了siRNA軟件運行的時間復雜度。本文還得到siRNA抑制率和siRNA雙鏈5’端的能量差在統(tǒng)計上有明顯的正相關關系,即siRNA雙鏈5’端的能量差越高,siRNA的抑制率越高;相反,siRNA雙鏈5’端的能量差越低,siRNA的抑制率越低。之后我們在Dieter Huesken數(shù)據(jù)集進行了統(tǒng)計分析,結果為:(1)反義鏈5’端的第1位置應該是A或者U,非G、C;(2)第2位應該是A或者U,非G、C;(3)第7位應該是非C;(4)第14位應該是非G;
[Abstract]:RNA interferes with ribonucleic Acid interference by introducing double-stranded RNA into organisms. The design of siRNA with high inhibition rate is an important precondition of RNA interference technology. Because it completely depends on the method of biological experiment to design highly efficient siRNAs, it has a high investment in biological experiments. It takes a long time and is inefficient, so computer information technology is used to optimize the siRNA design with high inhibition rate. The design of siRNA with the help of bioinformatics is to learn from the existing experimental data sets and build a prediction model. The user inputs the target m RNA sequence and outputs candidate siRNA sequences with high suppression rate. At present, there are some siRNA prediction software, but most of them are only based on the sequence features of siRNA itself, which leads to the low accuracy of prediction. But without feature selection as an important "data preprocessing" process, the program that builds the prediction model is time-consuming and less accurate. It is necessary to select features first after obtaining data, and it will also improve the efficiency of the program when training the learner in the later stage. Filtering feature selection is the step of feature selection for the data set first and then training the learner. The process of this feature selection method is independent of the follower. When evaluating the features, the filtering feature selection algorithm scores the corresponding weights on all the features of the data. After giving the corresponding weight score to the feature set, the feature whose weight value is less than the set threshold will be removed, and some features above the set threshold will be preserved. Then it is used for feature analysis or classification to construct the feature relationship model. In this paper, according to the actual distribution of the experimental data set, the 107 features of siRNA, which are commonly used at present, are analyzed. The specific flow chart of the Relief feature selection algorithm is designed reasonably and 88 related features are selected from the experimental results. We use 88 correlation features to train the random forest prediction model from 0.629 to 0.640, and improve the efficiency of constructing the stochastic forest prediction model. The time complexity of siRNA software is reduced, and the statistical positive correlation between the inhibition rate of siRNA and the energy difference at the 5 'end of siRNA double strand is obtained, that is, the higher the energy difference of siRNA double strand 5' terminal is, the higher the inhibition rate of siRNA is. On the contrary, the lower the energy difference at the 5'end of siRNA is, the lower the inhibition rate of siRNA is. The results were as follows: (1) the first position of the 5'terminal of the antisense chain should be A or U, the second position should be A or U, and the second position should be A or U, and the seventh position should be non-Cf4) and the 14th position should be non-G;
【學位授予單位】：吉林大學
【學位級別】：碩士
【學位授予年份】：2017
【分類號】：Q78;TP181

【參考文獻】

相關期刊論文前6條

1 李貞子;張濤;武曉巖;李康;;隨機森林回歸分析及在代謝調控關系研究中的應用[J];中國衛(wèi)生統(tǒng)計;2012年02期

2 劉元寧;常亞萍;李妼;張浩;田明堯;;針對H1N1病毒的多特征siRNA設計[J];吉林大學學報(工學版);2010年03期

3 史毅;金由辛;;RNA干擾與siRNA(小干擾RNA)研究進展[J];生命科學;2008年02期

4 胡穎;葉楓;謝幸;;RNA干擾技術中siRNA設計原則的研究進展[J];國際遺傳學雜志;2007年06期

5 許德暉;黃辰;劉利英;宋土生;;高效siRNA設計的研究進展[J];遺傳;2006年11期

6 譚金祥,任國勝;RNA干擾技術的研究進展[J];重慶醫(yī)學;2005年02期

相關博士學位論文前2條

1 常亞萍;siRNA設計中若干關鍵問題的研究[D];吉林大學;2013年

2 陳曉林;基于動態(tài)代價敏感的機器學習研究[D];華中科技大學;2010年

，

本文編號：1651331

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/shoufeilunwen/benkebiyelunwen/1651331.html

上一篇：時間尺度上非完整力學系統(tǒng)的Noether對稱性與守恒量
下一篇：關于CD133作為成纖維細胞轉分化為神經(jīng)干細胞篩選標記的研究

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于Relief算法的siRNA特征選擇研究