基于機器學(xué)習(xí)的微孢子蟲PolyA位點預(yù)測研究
本文選題:微孢子蟲 切入點:SVM 出處:《西南大學(xué)》2017年碩士論文 論文類型:學(xué)位論文
【摘要】:隨著人類基因組測序計劃的啟動和發(fā)展,生物信息學(xué)應(yīng)運而生。生物學(xué)與信息技術(shù)的相互交叉,不僅促進了計算機科學(xué)的發(fā)展,也極大地推動了生物學(xué)的應(yīng)用研究。西南大學(xué)家蠶基因組生物學(xué)國家重點實驗室是國內(nèi)一個先進的家蠶研究實驗室,目前有家蠶基因組和功能基因組、家蠶遺傳資源與蠶,F(xiàn)代產(chǎn)業(yè)技術(shù)、蠶桑病原微生物及微生物資源利用等方面的研究。家蠶病原體能夠感染家蠶,并能影響家蠶的生長發(fā)育,給養(yǎng)蠶業(yè)帶來較大的損失。因此作為一個研究方向,吸引了越來越多的學(xué)者。生物體不斷變化,基因組信息也千差萬別,機器學(xué)習(xí)的許多算法已經(jīng)被運用在人類基因和水稻基因的預(yù)測中,然而微孢子蟲作為感染家蠶的一種病原體,基于計算機算法的研究卻寥寥無幾。本文即是利用機器學(xué)習(xí)中的算法來對微孢子蟲Poly A位點進行預(yù)測并展開深入的研究。相較于生物學(xué)的方法,提高了工作效率,也為生物學(xué)中微孢子蟲的研究提供了一個很好的思路。機器學(xué)習(xí)是通過計算的手段利用經(jīng)驗來改善系統(tǒng)自身的性能。隨著計算機領(lǐng)域各種新技術(shù)和新方法的產(chǎn)生,這些方法逐漸應(yīng)用到生物信息學(xué)領(lǐng)域,并且在基因預(yù)測領(lǐng)域的應(yīng)用越來越廣泛。多聚腺苷酸化是真核細(xì)胞內(nèi)形成成熟mRNA的一個重要步驟,其位點的預(yù)測對基因組序列中編碼基因的發(fā)掘具有重要的意義。在與家蠶微孢子蟲研究小組經(jīng)過深入的討論后,本文以缺乏有效基因預(yù)測方法的家蠶病原體微孢子蟲Encephalitozoon cuniculi作為研究對象,以Z曲線、位置特異性打分矩陣和k階核苷酸頻率為基礎(chǔ)對微孢子蟲Encephalitozoon cuniculi基因序列進行特征提取,在提取k階特征之后,我們對提取的k階核苷酸頻率特征進行組合,并通過實驗結(jié)果對比,選擇最優(yōu)的組合。把最優(yōu)組合與位置特異性打分矩陣和Z曲線作為最后的輸入特征。通過對該特征進行PCA降維,減少特征空間的維度,從而減少算法復(fù)雜度。最后,我們使用不同的分類器對獲取到的特征進行訓(xùn)練分類,進而得到微孢子蟲PolyA位點的預(yù)測結(jié)果。該方法能夠根據(jù)微孢子蟲基因序列的表達偏好來選取最優(yōu)的k階核苷酸頻率特征,這對最后提取微孢子蟲PolyA位點的特征起到一定的作用,從而對分類結(jié)果產(chǎn)生影響。為了提高微孢子蟲PolyA位點預(yù)測算法的準(zhǔn)確度,選擇合適的特征提取方法對后續(xù)的分類極其重要。支持向量機被廣泛的應(yīng)用在不同的領(lǐng)域,在文本分類、車牌識別和圖像檢索等研究領(lǐng)域已有很多成果。本文利用支持向量機、神經(jīng)網(wǎng)絡(luò)和KNN算法均對微孢子蟲PolyA位點進行了預(yù)測研究,實驗結(jié)果證明支持向量機的分類效果比較好。核函數(shù)是支持向量機分類的一個重要因素,鑒于目前條件正定核已經(jīng)被廣泛應(yīng)用于文本分類和人臉識別領(lǐng)域,在本文實驗結(jié)果得出的多項式核分類效果比較好的基礎(chǔ)之上,將多項式核與條件正定核進行線性組合形成一個新的核函數(shù),并將此混合核函數(shù)應(yīng)用到微孢子蟲的PolyA位點預(yù)測領(lǐng)域,實驗結(jié)果表明,混合核函數(shù)作為SVM的核函數(shù),通過對模型參數(shù)的調(diào)整和修改,分類效果有了一個很大的提高。為以后微孢子蟲生物學(xué)研究提供了便利,也為家蠶病蟲害的有效防治提供了一定的依據(jù),具有重要的應(yīng)用價值。
[Abstract]:With the initiation and development of human genome sequencing, bioinformatics emerged. Cross biology and information technology, not only promoted the development of computer science, but also greatly promote the application of biology. Southwestern University State Key Laboratory of silkworm genome biology is a domestic advanced research laboratory of silkworm, Bombyx mori genome and function at present the genome, genetic resources and technology of sericulture silkworm modern industry, the research of silkworm pathogenic microorganisms and microbial resources. Silkworm pathogens can infect silkworm, Bombyx mori and can affect the growth and development of sericulture, catering to bring greater losses. Therefore as a research direction, has attracted more and more scholars. Organisms are constantly changing, genomic information is different, many machine learning algorithms have been used in the human genome and rice base For the prediction, however, microsporidia as a pathogen of silkworm, the research of computer algorithm based on it. This paper is scanty using machine learning algorithm to microsporidian Poly A sites were analyzed and studied in depth. Compared with the method of biology, improve work efficiency, but also provide a good idea for the study of Microsporidia in biology. Machine learning is the performance through the calculation by means of experience to improve the system of their own. With the development of computer field of various new technologies and new methods, these methods gradually applied to the field of bioinformatics, and the gene prediction is used more and more areas widely. Polyadenylation is an important step in the formation of mature mRNA in eukaryotic cells, is of great significance to explore the prediction of sites for encoding genomic sequences of genes. At home and The silkworm microsporidian research group after in-depth discussion, the lack of effective gene prediction method of silkworm microsporidian pathogen Encephalitozoon cuniculi as the research object, using Z curve, position specific scoring matrix and k order frequency characteristics based on the nucleotide sequence of cuniculi gene of microsporidia Encephalitozoon extraction, after extraction of order k we extracted K features of order nucleotide frequency characteristics, and through the comparison of experimental results, the optimal portfolio selection. The optimal combination and position specific scoring matrix and Z curves as input features. Finally through PCA on the feature reduction, reduce the dimension of feature space, thus reducing the complexity of the algorithm finally, we use different feature classifier by training the obtained classification, and then get the prediction results of microsporidian PolyA sites. This method can According to express a preference for Cryptosporidium micro gene sequence to select the optimal order k nucleotide frequency characteristics, characteristics of the final extraction of microsporidia PolyA sites play a role, so as to affect the classification results. In order to improve the microsporidian PolyA sites prediction algorithm accuracy, choosing the appropriate feature extraction method is very important for the subsequent classification. Support vector machine is widely used in different areas, a lot of achievements in the text classification, license plate recognition and image retrieval research field. This paper uses the existing support vector machine, neural network and KNN algorithm of microsporidia PolyA loci were predicted research, experimental results show that the classification effect of support vector machine is better. The kernel function is an important factor of support vector machine classification, given the current conditions of positive definite kernel has been widely used in text classification and face recognition. In the domain of polynomial kernel classification results the experimental results obtained relatively good foundation, will be conditionally positive definite kernel polynomial kernel and the linear combination of the formation of a new kernel function, and the mixed kernel function is applied to the PolyA locus microsporogonia forecasting field. The experimental results show that the mixed kernel function as kernel function SVM and through the adjustment and modification of the model parameters, the classification results have a greatly improved. For the study of microsporidian biology provides a convenient, provides a basis for effective prevention and control of pests and diseases of silkworm also, has important application value.
【學(xué)位授予單位】:西南大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2017
【分類號】:Q811.4;TP181
【參考文獻】
相關(guān)期刊論文 前10條
1 楊亮;張紅星;崔英;周鋼橋;;可選擇性多聚腺苷酸化的生物學(xué)功能[J];軍事醫(yī)學(xué);2015年05期
2 李琴;張瑾;駢聰;陳園園;李強;張良云;;基于位置關(guān)聯(lián)權(quán)重矩陣及序列組分的多樣性增量識別剪接位點[J];生物物理學(xué)報;2014年05期
3 劉建偉;劉媛;羅雄麟;;半監(jiān)督學(xué)習(xí)方法[J];計算機學(xué)報;2015年08期
4 阮越;陳漢武;劉志昊;張俊;朱皖寧;;量子主成分分析算法[J];計算機學(xué)報;2014年03期
5 羅潔;林立鵬;潘國慶;劉婷;劉顯林;周澤揚;;家蠶微孢子蟲NbTom40的原核表達及定位[J];西南大學(xué)學(xué)報(自然科學(xué)版);2013年05期
6 于釗;杜偉;;生物信息學(xué)及其廣泛應(yīng)用[J];國際學(xué)術(shù)動態(tài);2013年02期
7 田鵬;孫雨;鄒華;;mRNA3'末端非編碼區(qū)及其多態(tài)性在炎癥與免疫中的調(diào)控作用[J];醫(yī)學(xué)綜述;2012年19期
8 蘇煜;山世光;陳熙霖;高文;;基于全局和局部特征集成的人臉識別[J];軟件學(xué)報;2010年08期
9 滕曉坤;肖華勝;;基因芯片與高通量DNA測序技術(shù)前景分析[J];中國科學(xué)(C輯:生命科學(xué));2008年10期
10 李艷紅;謝儷;潘國慶;吳正理;龐敏;周澤揚;;家蠶微孢子蟲抗體免疫熒光檢測方法的建立及應(yīng)用[J];西南農(nóng)業(yè)大學(xué)學(xué)報(自然科學(xué)版);2006年06期
相關(guān)博士學(xué)位論文 前2條
1 郭鋒彪;原核生物蛋白質(zhì)編碼區(qū)識別及基因組序列分析[D];天津大學(xué);2005年
2 陳玲玲;原核與真核生物蛋白質(zhì)編碼區(qū)識別及基因組分析[D];天津大學(xué);2004年
,本文編號:1639675
本文鏈接:http://sikaile.net/kejilunwen/zidonghuakongzhilunwen/1639675.html