基于生物醫(yī)學文本挖掘的蛋白質(zhì)間相互作用關系抽取方法的研究
[Abstract]:In recent years, with the rapid increase of the number of documents in the field of biomedicine, the use of data mining technology to acquire the necessary biomedical knowledge from biomedical literature has become a hot topic in the field of bioinformatics. Protein plays the most fundamental and important role of its biological function by protein interaction (PPI), while a large number of protein interaction information is recorded in biomedical literature in the form of unstructured data, It is very time consuming to find PPI information in the literature by manual review. Therefore, it is very important to extract PPI relation accurately by mining and analyzing the relationship between protein interaction in biomedical literature by using text mining technique. In the research of PPI relationship extraction, PPI relations are extracted from biomedical literature as a two-valued classification problem. In PPI extraction task, statistical and machine-based learning algorithms are adopted, and feature vectors are formed by feature extraction of biological texts so as to construct a classification model. a better extraction effect is obtained. However, the machine learning methods employed in the present research are usually supervised learning methods, require a large number of labeled PPI relational data to construct the classification model, and in the field of biomedicine, it is necessary to spend a large amount of manpower and time cost in the field of biomedicine. In order to reduce the requirement of constructing classification model to label data, this paper studies on the following two aspects: 1, the PPI relational data set to be classified is regarded as the target domain data set based on the relationship between the remote supervision and the migration learning extraction protein, In order to reduce the demand for dimension data in PPI relationship extraction in the target field, this study uses migration learning to construct a relationship extraction model by carrying out knowledge migration on PPI relational data sets in different distributed source fields, thus classifying PPI samples to be classified in the target field. In this paper, based on the remote supervision idea, the author constructs the artificial dimension corpus as the source field PPI data set, first downloads PPI data from the InteAct protein interaction database as the relation knowledge base, and climbs the biomedical literature abstract from the PubMed database as the original corpus, According to PPI pairs in the knowledge base, mapping is carried out in the original corpus, a statement containing the PPI is obtained by a heuristic matching, and the PPI with the mapping exists in the original corpus is taken as a positive example sample, otherwise, the artificially annotated PPI data set is obtained as a negative sample sample. Using the example-based migration learning method, TrAdaboost constructs a classification model on the constructed source domain PPI data set and partial target PPI data set, and classifies PPI samples in the target field. The experimental results on three standard data sets show that this study uses the artificial data set constructed by the remote supervision to establish a classification model, and in the case of fewer samples in the target field, extracting protein interactions based on migration learning and remote monitoring under a PU (Positive Unlabed) scenario is often not marked or marked in a small amount, such as the PPI data set involved in this study. Due to the constraints of experimental conditions, the existing PPI relationships do not determine whether they interact, so this part of data can be treated as unlabeled data sets, only a small number of PPI relationships do exist after the experimental verification, and this part of data can be considered as positive samples. In this case, traditional supervised algorithms fail to construct efficient classification models to identify PPI relationships in biological literature. On the basis of remote supervision, this paper studies two aspects of migration learning and PU learning, and proposes a method for extracting protein interaction relationship based on migration learning and remote supervision in PU scene. The method collects the characteristic information of the target PPI data set, carries out knowledge migration to the weight of the source PPI data set sample by utilizing the data attraction method, estimates the probability parameter on the weighted source PPI data set based on the Bayesian theory, A weight-based PU learning algorithm is constructed by using static classifier integration technology. The experimental results show that the TPAODE algorithm proposed in this study does not need a category label for PPI data sets in the target field, and only a sample with an interaction relationship is labeled in the PPI data set in the source field, and a classification model is constructed based on the PPI data set in the source field and the PPI data set in the target field. have comparable or better performance than conventional pu methods. In order to further reduce the requirement of the model to dimension data, this study uses the artificial PPI data set constructed by the remote supervision as the source field data set, classifies the PPI samples in the target field based on the source data set and the target data set learning model with only a few positive samples, The results show that the TPAODE algorithm proposed by this study still has better classification performance than the existing PU learning methods PNB and PTAN.
【學位授予單位】:西北農(nóng)林科技大學
【學位級別】:碩士
【學位授予年份】:2017
【分類號】:Q51;TP391.1
【參考文獻】
相關期刊論文 前9條
1 李滿生;常乘;馬潔;朱云平;;基于機器學習的蛋白質(zhì)相互作用文獻挖掘方法研究進展[J];中國科學:生命科學;2016年11期
2 張金蕾;李梅;張陽;梁春泉;王勇;;P-AnDT:平均n依賴決策樹的正例未標注學習算法[J];計算機應用研究;2016年07期
3 張荷;李梅;張陽;蔡曉妍;;基于PU學習的軟件故障檢測研究[J];計算機應用研究;2015年11期
4 潘云;布勒布麗汗·伊沙巴依;楊靜;尹敏;;利用中文在線資源的遠程監(jiān)督人物關系抽取[J];小型微型計算機系統(tǒng);2015年04期
5 邵強;張陽;蔡曉妍;;基于隨機森林的正例與未標注學習[J];計算機工程與設計;2014年12期
6 莊福振;羅平;何清;史忠植;;遷移學習研究進展[J];軟件學報;2015年01期
7 王健;冀明輝;林鴻飛;楊志豪;;基于上下文環(huán)境和句法分析的蛋白質(zhì)關系抽取[J];計算機應用;2012年04期
8 李滿生;劉齊軍;李棟;劉培磊;朱云平;;蛋白質(zhì)相互作用信息的文本挖掘研究進展[J];中國科學:生命科學;2010年09期
9 蔣盛益,李慶華;一種基于引力的聚類方法[J];計算機應用;2005年02期
相關碩士學位論文 前8條
1 郭瑞;基于遷移學習和詞表示的蛋白質(zhì)交互關系抽取[D];大連理工大學;2015年
2 宋寶興;功能相似蛋白質(zhì)挖掘及蛋白質(zhì)相互作用預測平臺[D];西北農(nóng)林科技大學;2013年
3 封二英;基于大規(guī)模文本的蛋白質(zhì)交互關系自動提取研究[D];南京航空航天大學;2012年
4 孫雅銘;生物醫(yī)學文本中蛋白質(zhì)相互作用關系抽取關鍵技術研究[D];哈爾濱工業(yè)大學;2012年
5 何佳珍;不確定數(shù)據(jù)的PU學習貝葉斯分類器研究[D];西北農(nóng)林科技大學;2012年
6 李滿生;基于本體的蛋白質(zhì)相互作用信息文本挖掘方法研究[D];中國人民解放軍軍事醫(yī)學科學院;2010年
7 虞歡歡;基于機器學習的蛋白質(zhì)相互作用關系抽取的研究[D];蘇州大學;2010年
8 戴文淵;基于實例和特征的遷移學習算法研究[D];上海交通大學;2009年
,本文編號:2273008
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2273008.html