天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當前位置:主頁 > 科技論文 > 軟件論文 >

基于生物醫(yī)學文本挖掘的蛋白質(zhì)間相互作用關系抽取方法的研究

發(fā)布時間:2018-10-15 15:43
【摘要】:近年來,在生物醫(yī)學領域隨著其文獻數(shù)量的快速增長,利用數(shù)據(jù)挖掘技術從生物醫(yī)學文獻中獲取所需的生物醫(yī)學知識已經(jīng)成為生物信息學領域的研究熱點。蛋白質(zhì)發(fā)揮其生物功能最基礎和重要的一種方式就是通過蛋白質(zhì)間相互作用(Protein-Protein Interaction,PPI),而大量的蛋白質(zhì)相互作用信息都以非結(jié)構(gòu)化數(shù)據(jù)的形式記錄在生物醫(yī)學文獻中,人工檢閱的方式查找文獻中的PPI信息十分耗時費力,因此,利用文本挖掘技術對生物醫(yī)學文獻中的蛋白質(zhì)相互作用關系進行挖掘和分析,從而準確的提取PPI關系具有十分重要的意義,F(xiàn)有的PPI關系抽取的研究中將從生物醫(yī)學文獻中抽取PPI關系視為一個二值分類問題,PPI抽取任務中多采用基于統(tǒng)計和機器學習的算法,通過對生物文本進行特征提取形成特征向量,從而構(gòu)建分類模型,取得了較好的抽取效果。但是現(xiàn)有研究中所采用的機器學習方法通常是監(jiān)督學習方法,需要大量已標注的PPI關系數(shù)據(jù)來構(gòu)建分類模型,而在生物醫(yī)學領域,利用人工標注PPI關系語料需要花費大量的人力和時間成本。為了降低構(gòu)建分類模型對標注數(shù)據(jù)的要求,本文從以下兩個方面進行研究:1、基于遠程監(jiān)督和遷移學習提取蛋白質(zhì)相互作用關系將待分類的PPI關系數(shù)據(jù)集視為目標領域數(shù)據(jù)集,為降低目標領域PPI關系抽取中對標注數(shù)據(jù)的需求,本研究使用遷移學習,通過對不同分布的源領域PPI關系數(shù)據(jù)集進行知識遷移,來構(gòu)建關系抽取模型,從而對目標領域待分類PPI樣本進行分類。本研究基于遠程監(jiān)督思想構(gòu)建人工標注語料作為源領域PPI數(shù)據(jù)集,首先從IntAct蛋白質(zhì)相互作用數(shù)據(jù)庫中下載PPI數(shù)據(jù)作為關系知識庫,并從PubMed數(shù)據(jù)庫中爬取生物醫(yī)學文獻摘要作為原始語料集,根據(jù)知識庫中的PPI對在原始語料集中進行映射,通過啟發(fā)式的匹配來獲取包含有該PPI的語句,將原始語料集中存在映射的PPI作為正例樣本,否則作為負例樣本,以此得到人工標注的PPI數(shù)據(jù)集。使用基于實例的遷移學習方法TrAdaboost在構(gòu)建的源領域PPI數(shù)據(jù)集和部分目標PPI數(shù)據(jù)集上構(gòu)建分類模型,對目標領域的PPI樣本進行分類。在3個標準數(shù)據(jù)集上的實驗結(jié)果表明,本研究利用遠程監(jiān)督構(gòu)建的人工數(shù)據(jù)集能夠很好的輔助算法建立分類模型,在目標領域標注樣本較少的情況下,通過遷移人工數(shù)據(jù)集的知識對目標領域PPI關系進行抽取具有較好的性能。2、PU(Positive Unlabeled)場景下基于遷移學習和遠程監(jiān)督提取蛋白質(zhì)相互作用在實際應用中,數(shù)據(jù)經(jīng)常是未標注的或少量標注的,如本研究中涉及到的PPI數(shù)據(jù)集。由于實驗條件的制約,現(xiàn)有的很多PPI關系并不能確定其是否有相互作用,因此可以將這部分數(shù)據(jù)視為未標注數(shù)據(jù)集,僅有少量的PPI關系經(jīng)實驗驗證后確實存在相互作用,這部分數(shù)據(jù)可以視為正例樣本。在這種情況下,傳統(tǒng)的有監(jiān)督算法就無法構(gòu)建高效的分類模型來對生物文獻中的PPI關系進行識別。在遠程監(jiān)督的基礎上,本研究從遷移學習和PU學習兩個角度展開研究,提出了在PU場景下基于遷移學習和遠程監(jiān)督的蛋白質(zhì)相互作用關系抽取方法——TPAODE算法。該方法收集目標PPI數(shù)據(jù)集的特征信息,利用數(shù)據(jù)引力方法對源PPI數(shù)據(jù)集樣本賦予權重進行知識遷移,基于貝葉斯理論在加權的源PPI數(shù)據(jù)集上估算概率參數(shù),利用靜態(tài)分類器集成技術構(gòu)建基于權重的PU學習算法。實驗結(jié)果表明,本研究提出的TPAODE算法對目標領域PPI數(shù)據(jù)集不需要類別標注,僅在源領域PPI數(shù)據(jù)集上標注部分有相互作用關系的樣本,基于源領域PPI數(shù)據(jù)集和目標領域PPI數(shù)據(jù)集構(gòu)建分類模型,具有比傳統(tǒng)PU方法相當或更好的性能。為了進一步降低模型對標注數(shù)據(jù)的要求,本研究將前文利用遠程監(jiān)督構(gòu)建的人工PPI數(shù)據(jù)集作為源領域數(shù)據(jù)集,基于僅有少量正例樣本的源數(shù)據(jù)集和目標數(shù)據(jù)集學習模型,對目標領域的PPI樣本進行分類,結(jié)果表明,本研究提出的TPAODE算法利用遠程監(jiān)督數(shù)據(jù)集依然比現(xiàn)有的PU學習方法PNB和PTAN具有更優(yōu)異的分類性能。
[Abstract]:In recent years, with the rapid increase of the number of documents in the field of biomedicine, the use of data mining technology to acquire the necessary biomedical knowledge from biomedical literature has become a hot topic in the field of bioinformatics. Protein plays the most fundamental and important role of its biological function by protein interaction (PPI), while a large number of protein interaction information is recorded in biomedical literature in the form of unstructured data, It is very time consuming to find PPI information in the literature by manual review. Therefore, it is very important to extract PPI relation accurately by mining and analyzing the relationship between protein interaction in biomedical literature by using text mining technique. In the research of PPI relationship extraction, PPI relations are extracted from biomedical literature as a two-valued classification problem. In PPI extraction task, statistical and machine-based learning algorithms are adopted, and feature vectors are formed by feature extraction of biological texts so as to construct a classification model. a better extraction effect is obtained. However, the machine learning methods employed in the present research are usually supervised learning methods, require a large number of labeled PPI relational data to construct the classification model, and in the field of biomedicine, it is necessary to spend a large amount of manpower and time cost in the field of biomedicine. In order to reduce the requirement of constructing classification model to label data, this paper studies on the following two aspects: 1, the PPI relational data set to be classified is regarded as the target domain data set based on the relationship between the remote supervision and the migration learning extraction protein, In order to reduce the demand for dimension data in PPI relationship extraction in the target field, this study uses migration learning to construct a relationship extraction model by carrying out knowledge migration on PPI relational data sets in different distributed source fields, thus classifying PPI samples to be classified in the target field. In this paper, based on the remote supervision idea, the author constructs the artificial dimension corpus as the source field PPI data set, first downloads PPI data from the InteAct protein interaction database as the relation knowledge base, and climbs the biomedical literature abstract from the PubMed database as the original corpus, According to PPI pairs in the knowledge base, mapping is carried out in the original corpus, a statement containing the PPI is obtained by a heuristic matching, and the PPI with the mapping exists in the original corpus is taken as a positive example sample, otherwise, the artificially annotated PPI data set is obtained as a negative sample sample. Using the example-based migration learning method, TrAdaboost constructs a classification model on the constructed source domain PPI data set and partial target PPI data set, and classifies PPI samples in the target field. The experimental results on three standard data sets show that this study uses the artificial data set constructed by the remote supervision to establish a classification model, and in the case of fewer samples in the target field, extracting protein interactions based on migration learning and remote monitoring under a PU (Positive Unlabed) scenario is often not marked or marked in a small amount, such as the PPI data set involved in this study. Due to the constraints of experimental conditions, the existing PPI relationships do not determine whether they interact, so this part of data can be treated as unlabeled data sets, only a small number of PPI relationships do exist after the experimental verification, and this part of data can be considered as positive samples. In this case, traditional supervised algorithms fail to construct efficient classification models to identify PPI relationships in biological literature. On the basis of remote supervision, this paper studies two aspects of migration learning and PU learning, and proposes a method for extracting protein interaction relationship based on migration learning and remote supervision in PU scene. The method collects the characteristic information of the target PPI data set, carries out knowledge migration to the weight of the source PPI data set sample by utilizing the data attraction method, estimates the probability parameter on the weighted source PPI data set based on the Bayesian theory, A weight-based PU learning algorithm is constructed by using static classifier integration technology. The experimental results show that the TPAODE algorithm proposed in this study does not need a category label for PPI data sets in the target field, and only a sample with an interaction relationship is labeled in the PPI data set in the source field, and a classification model is constructed based on the PPI data set in the source field and the PPI data set in the target field. have comparable or better performance than conventional pu methods. In order to further reduce the requirement of the model to dimension data, this study uses the artificial PPI data set constructed by the remote supervision as the source field data set, classifies the PPI samples in the target field based on the source data set and the target data set learning model with only a few positive samples, The results show that the TPAODE algorithm proposed by this study still has better classification performance than the existing PU learning methods PNB and PTAN.
【學位授予單位】:西北農(nóng)林科技大學
【學位級別】:碩士
【學位授予年份】:2017
【分類號】:Q51;TP391.1

【參考文獻】

相關期刊論文 前9條

1 李滿生;常乘;馬潔;朱云平;;基于機器學習的蛋白質(zhì)相互作用文獻挖掘方法研究進展[J];中國科學:生命科學;2016年11期

2 張金蕾;李梅;張陽;梁春泉;王勇;;P-AnDT:平均n依賴決策樹的正例未標注學習算法[J];計算機應用研究;2016年07期

3 張荷;李梅;張陽;蔡曉妍;;基于PU學習的軟件故障檢測研究[J];計算機應用研究;2015年11期

4 潘云;布勒布麗汗·伊沙巴依;楊靜;尹敏;;利用中文在線資源的遠程監(jiān)督人物關系抽取[J];小型微型計算機系統(tǒng);2015年04期

5 邵強;張陽;蔡曉妍;;基于隨機森林的正例與未標注學習[J];計算機工程與設計;2014年12期

6 莊福振;羅平;何清;史忠植;;遷移學習研究進展[J];軟件學報;2015年01期

7 王健;冀明輝;林鴻飛;楊志豪;;基于上下文環(huán)境和句法分析的蛋白質(zhì)關系抽取[J];計算機應用;2012年04期

8 李滿生;劉齊軍;李棟;劉培磊;朱云平;;蛋白質(zhì)相互作用信息的文本挖掘研究進展[J];中國科學:生命科學;2010年09期

9 蔣盛益,李慶華;一種基于引力的聚類方法[J];計算機應用;2005年02期

相關碩士學位論文 前8條

1 郭瑞;基于遷移學習和詞表示的蛋白質(zhì)交互關系抽取[D];大連理工大學;2015年

2 宋寶興;功能相似蛋白質(zhì)挖掘及蛋白質(zhì)相互作用預測平臺[D];西北農(nóng)林科技大學;2013年

3 封二英;基于大規(guī)模文本的蛋白質(zhì)交互關系自動提取研究[D];南京航空航天大學;2012年

4 孫雅銘;生物醫(yī)學文本中蛋白質(zhì)相互作用關系抽取關鍵技術研究[D];哈爾濱工業(yè)大學;2012年

5 何佳珍;不確定數(shù)據(jù)的PU學習貝葉斯分類器研究[D];西北農(nóng)林科技大學;2012年

6 李滿生;基于本體的蛋白質(zhì)相互作用信息文本挖掘方法研究[D];中國人民解放軍軍事醫(yī)學科學院;2010年

7 虞歡歡;基于機器學習的蛋白質(zhì)相互作用關系抽取的研究[D];蘇州大學;2010年

8 戴文淵;基于實例和特征的遷移學習算法研究[D];上海交通大學;2009年

,

本文編號:2273008

資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2273008.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權申明:資料由用戶10cc6***提供,本站僅收錄摘要或目錄,作者需要刪除請E-mail郵箱bigeng88@qq.com