基于生物醫(yī)學(xué)文本挖掘的蛋白質(zhì)間相互作用關(guān)系抽取方法的研究

發(fā)布時(shí)間：2018-10-15 15:43

【摘要】：近年來,在生物醫(yī)學(xué)領(lǐng)域隨著其文獻(xiàn)數(shù)量的快速增長(zhǎng),利用數(shù)據(jù)挖掘技術(shù)從生物醫(yī)學(xué)文獻(xiàn)中獲取所需的生物醫(yī)學(xué)知識(shí)已經(jīng)成為生物信息學(xué)領(lǐng)域的研究熱點(diǎn)。蛋白質(zhì)發(fā)揮其生物功能最基礎(chǔ)和重要的一種方式就是通過蛋白質(zhì)間相互作用(Protein-Protein Interaction,PPI),而大量的蛋白質(zhì)相互作用信息都以非結(jié)構(gòu)化數(shù)據(jù)的形式記錄在生物醫(yī)學(xué)文獻(xiàn)中,人工檢閱的方式查找文獻(xiàn)中的PPI信息十分耗時(shí)費(fèi)力,因此,利用文本挖掘技術(shù)對(duì)生物醫(yī)學(xué)文獻(xiàn)中的蛋白質(zhì)相互作用關(guān)系進(jìn)行挖掘和分析,從而準(zhǔn)確的提取PPI關(guān)系具有十分重要的意義�，F(xiàn)有的PPI關(guān)系抽取的研究中將從生物醫(yī)學(xué)文獻(xiàn)中抽取PPI關(guān)系視為一個(gè)二值分類問題,PPI抽取任務(wù)中多采用基于統(tǒng)計(jì)和機(jī)器學(xué)習(xí)的算法,通過對(duì)生物文本進(jìn)行特征提取形成特征向量,從而構(gòu)建分類模型,取得了較好的抽取效果。但是現(xiàn)有研究中所采用的機(jī)器學(xué)習(xí)方法通常是監(jiān)督學(xué)習(xí)方法,需要大量已標(biāo)注的PPI關(guān)系數(shù)據(jù)來構(gòu)建分類模型,而在生物醫(yī)學(xué)領(lǐng)域,利用人工標(biāo)注PPI關(guān)系語料需要花費(fèi)大量的人力和時(shí)間成本。為了降低構(gòu)建分類模型對(duì)標(biāo)注數(shù)據(jù)的要求,本文從以下兩個(gè)方面進(jìn)行研究:1、基于遠(yuǎn)程監(jiān)督和遷移學(xué)習(xí)提取蛋白質(zhì)相互作用關(guān)系將待分類的PPI關(guān)系數(shù)據(jù)集視為目標(biāo)領(lǐng)域數(shù)據(jù)集,為降低目標(biāo)領(lǐng)域PPI關(guān)系抽取中對(duì)標(biāo)注數(shù)據(jù)的需求,本研究使用遷移學(xué)習(xí),通過對(duì)不同分布的源領(lǐng)域PPI關(guān)系數(shù)據(jù)集進(jìn)行知識(shí)遷移,來構(gòu)建關(guān)系抽取模型,從而對(duì)目標(biāo)領(lǐng)域待分類PPI樣本進(jìn)行分類。本研究基于遠(yuǎn)程監(jiān)督思想構(gòu)建人工標(biāo)注語料作為源領(lǐng)域PPI數(shù)據(jù)集,首先從IntAct蛋白質(zhì)相互作用數(shù)據(jù)庫(kù)中下載PPI數(shù)據(jù)作為關(guān)系知識(shí)庫(kù),并從PubMed數(shù)據(jù)庫(kù)中爬取生物醫(yī)學(xué)文獻(xiàn)摘要作為原始語料集,根據(jù)知識(shí)庫(kù)中的PPI對(duì)在原始語料集中進(jìn)行映射,通過啟發(fā)式的匹配來獲取包含有該P(yáng)PI的語句,將原始語料集中存在映射的PPI作為正例樣本,否則作為負(fù)例樣本,以此得到人工標(biāo)注的PPI數(shù)據(jù)集。使用基于實(shí)例的遷移學(xué)習(xí)方法TrAdaboost在構(gòu)建的源領(lǐng)域PPI數(shù)據(jù)集和部分目標(biāo)PPI數(shù)據(jù)集上構(gòu)建分類模型,對(duì)目標(biāo)領(lǐng)域的PPI樣本進(jìn)行分類。在3個(gè)標(biāo)準(zhǔn)數(shù)據(jù)集上的實(shí)驗(yàn)結(jié)果表明,本研究利用遠(yuǎn)程監(jiān)督構(gòu)建的人工數(shù)據(jù)集能夠很好的輔助算法建立分類模型,在目標(biāo)領(lǐng)域標(biāo)注樣本較少的情況下,通過遷移人工數(shù)據(jù)集的知識(shí)對(duì)目標(biāo)領(lǐng)域PPI關(guān)系進(jìn)行抽取具有較好的性能。2、PU(Positive Unlabeled)場(chǎng)景下基于遷移學(xué)習(xí)和遠(yuǎn)程監(jiān)督提取蛋白質(zhì)相互作用在實(shí)際應(yīng)用中,數(shù)據(jù)經(jīng)常是未標(biāo)注的或少量標(biāo)注的,如本研究中涉及到的PPI數(shù)據(jù)集。由于實(shí)驗(yàn)條件的制約,現(xiàn)有的很多PPI關(guān)系并不能確定其是否有相互作用,因此可以將這部分?jǐn)?shù)據(jù)視為未標(biāo)注數(shù)據(jù)集,僅有少量的PPI關(guān)系經(jīng)實(shí)驗(yàn)驗(yàn)證后確實(shí)存在相互作用,這部分?jǐn)?shù)據(jù)可以視為正例樣本。在這種情況下,傳統(tǒng)的有監(jiān)督算法就無法構(gòu)建高效的分類模型來對(duì)生物文獻(xiàn)中的PPI關(guān)系進(jìn)行識(shí)別。在遠(yuǎn)程監(jiān)督的基礎(chǔ)上,本研究從遷移學(xué)習(xí)和PU學(xué)習(xí)兩個(gè)角度展開研究,提出了在PU場(chǎng)景下基于遷移學(xué)習(xí)和遠(yuǎn)程監(jiān)督的蛋白質(zhì)相互作用關(guān)系抽取方法——TPAODE算法。該方法收集目標(biāo)PPI數(shù)據(jù)集的特征信息,利用數(shù)據(jù)引力方法對(duì)源PPI數(shù)據(jù)集樣本賦予權(quán)重進(jìn)行知識(shí)遷移,基于貝葉斯理論在加權(quán)的源PPI數(shù)據(jù)集上估算概率參數(shù),利用靜態(tài)分類器集成技術(shù)構(gòu)建基于權(quán)重的PU學(xué)習(xí)算法。實(shí)驗(yàn)結(jié)果表明,本研究提出的TPAODE算法對(duì)目標(biāo)領(lǐng)域PPI數(shù)據(jù)集不需要類別標(biāo)注,僅在源領(lǐng)域PPI數(shù)據(jù)集上標(biāo)注部分有相互作用關(guān)系的樣本,基于源領(lǐng)域PPI數(shù)據(jù)集和目標(biāo)領(lǐng)域PPI數(shù)據(jù)集構(gòu)建分類模型,具有比傳統(tǒng)PU方法相當(dāng)或更好的性能。為了進(jìn)一步降低模型對(duì)標(biāo)注數(shù)據(jù)的要求,本研究將前文利用遠(yuǎn)程監(jiān)督構(gòu)建的人工PPI數(shù)據(jù)集作為源領(lǐng)域數(shù)據(jù)集,基于僅有少量正例樣本的源數(shù)據(jù)集和目標(biāo)數(shù)據(jù)集學(xué)習(xí)模型,對(duì)目標(biāo)領(lǐng)域的PPI樣本進(jìn)行分類,結(jié)果表明,本研究提出的TPAODE算法利用遠(yuǎn)程監(jiān)督數(shù)據(jù)集依然比現(xiàn)有的PU學(xué)習(xí)方法PNB和PTAN具有更優(yōu)異的分類性能。
[Abstract]:In recent years, with the rapid increase of the number of documents in the field of biomedicine, the use of data mining technology to acquire the necessary biomedical knowledge from biomedical literature has become a hot topic in the field of bioinformatics. Protein plays the most fundamental and important role of its biological function by protein interaction (PPI), while a large number of protein interaction information is recorded in biomedical literature in the form of unstructured data, It is very time consuming to find PPI information in the literature by manual review. Therefore, it is very important to extract PPI relation accurately by mining and analyzing the relationship between protein interaction in biomedical literature by using text mining technique. In the research of PPI relationship extraction, PPI relations are extracted from biomedical literature as a two-valued classification problem. In PPI extraction task, statistical and machine-based learning algorithms are adopted, and feature vectors are formed by feature extraction of biological texts so as to construct a classification model. a better extraction effect is obtained. However, the machine learning methods employed in the present research are usually supervised learning methods, require a large number of labeled PPI relational data to construct the classification model, and in the field of biomedicine, it is necessary to spend a large amount of manpower and time cost in the field of biomedicine. In order to reduce the requirement of constructing classification model to label data, this paper studies on the following two aspects: 1, the PPI relational data set to be classified is regarded as the target domain data set based on the relationship between the remote supervision and the migration learning extraction protein, In order to reduce the demand for dimension data in PPI relationship extraction in the target field, this study uses migration learning to construct a relationship extraction model by carrying out knowledge migration on PPI relational data sets in different distributed source fields, thus classifying PPI samples to be classified in the target field. In this paper, based on the remote supervision idea, the author constructs the artificial dimension corpus as the source field PPI data set, first downloads PPI data from the InteAct protein interaction database as the relation knowledge base, and climbs the biomedical literature abstract from the PubMed database as the original corpus, According to PPI pairs in the knowledge base, mapping is carried out in the original corpus, a statement containing the PPI is obtained by a heuristic matching, and the PPI with the mapping exists in the original corpus is taken as a positive example sample, otherwise, the artificially annotated PPI data set is obtained as a negative sample sample. Using the example-based migration learning method, TrAdaboost constructs a classification model on the constructed source domain PPI data set and partial target PPI data set, and classifies PPI samples in the target field. The experimental results on three standard data sets show that this study uses the artificial data set constructed by the remote supervision to establish a classification model, and in the case of fewer samples in the target field, extracting protein interactions based on migration learning and remote monitoring under a PU (Positive Unlabed) scenario is often not marked or marked in a small amount, such as the PPI data set involved in this study. Due to the constraints of experimental conditions, the existing PPI relationships do not determine whether they interact, so this part of data can be treated as unlabeled data sets, only a small number of PPI relationships do exist after the experimental verification, and this part of data can be considered as positive samples. In this case, traditional supervised algorithms fail to construct efficient classification models to identify PPI relationships in biological literature. On the basis of remote supervision, this paper studies two aspects of migration learning and PU learning, and proposes a method for extracting protein interaction relationship based on migration learning and remote supervision in PU scene. The method collects the characteristic information of the target PPI data set, carries out knowledge migration to the weight of the source PPI data set sample by utilizing the data attraction method, estimates the probability parameter on the weighted source PPI data set based on the Bayesian theory, A weight-based PU learning algorithm is constructed by using static classifier integration technology. The experimental results show that the TPAODE algorithm proposed in this study does not need a category label for PPI data sets in the target field, and only a sample with an interaction relationship is labeled in the PPI data set in the source field, and a classification model is constructed based on the PPI data set in the source field and the PPI data set in the target field. have comparable or better performance than conventional pu methods. In order to further reduce the requirement of the model to dimension data, this study uses the artificial PPI data set constructed by the remote supervision as the source field data set, classifies the PPI samples in the target field based on the source data set and the target data set learning model with only a few positive samples, The results show that the TPAODE algorithm proposed by this study still has better classification performance than the existing PU learning methods PNB and PTAN.
【學(xué)位授予單位】：西北農(nóng)林科技大學(xué)
【學(xué)位級(jí)別】：碩士
【學(xué)位授予年份】：2017
【分類號(hào)】：Q51;TP391.1

【參考文獻(xiàn)】

相關(guān)期刊論文前9條

1 李滿生;常乘;馬潔;朱云平;;基于機(jī)器學(xué)習(xí)的蛋白質(zhì)相互作用文獻(xiàn)挖掘方法研究進(jìn)展[J];中國(guó)科學(xué):生命科學(xué);2016年11期

2 張金蕾;李梅;張陽;梁春泉;王勇;;P-AnDT:平均n依賴決策樹的正例未標(biāo)注學(xué)習(xí)算法[J];計(jì)算機(jī)應(yīng)用研究;2016年07期

3 張荷;李梅;張陽;蔡曉妍;;基于PU學(xué)習(xí)的軟件故障檢測(cè)研究[J];計(jì)算機(jī)應(yīng)用研究;2015年11期

4 潘云;布勒布麗汗·伊沙巴依;楊靜;尹敏;;利用中文在線資源的遠(yuǎn)程監(jiān)督人物關(guān)系抽取[J];小型微型計(jì)算機(jī)系統(tǒng);2015年04期

5 邵強(qiáng);張陽;蔡曉妍;;基于隨機(jī)森林的正例與未標(biāo)注學(xué)習(xí)[J];計(jì)算機(jī)工程與設(shè)計(jì);2014年12期

6 莊福振;羅平;何清;史忠植;;遷移學(xué)習(xí)研究進(jìn)展[J];軟件學(xué)報(bào);2015年01期

7 王健;冀明輝;林鴻飛;楊志豪;;基于上下文環(huán)境和句法分析的蛋白質(zhì)關(guān)系抽取[J];計(jì)算機(jī)應(yīng)用;2012年04期

8 李滿生;劉齊軍;李棟;劉培磊;朱云平;;蛋白質(zhì)相互作用信息的文本挖掘研究進(jìn)展[J];中國(guó)科學(xué):生命科學(xué);2010年09期

9 蔣盛益,李慶華;一種基于引力的聚類方法[J];計(jì)算機(jī)應(yīng)用;2005年02期

相關(guān)碩士學(xué)位論文前8條

1 郭瑞;基于遷移學(xué)習(xí)和詞表示的蛋白質(zhì)交互關(guān)系抽取[D];大連理工大學(xué);2015年

2 宋寶興;功能相似蛋白質(zhì)挖掘及蛋白質(zhì)相互作用預(yù)測(cè)平臺(tái)[D];西北農(nóng)林科技大學(xué);2013年

3 封二英;基于大規(guī)模文本的蛋白質(zhì)交互關(guān)系自動(dòng)提取研究[D];南京航空航天大學(xué);2012年

4 孫雅銘;生物醫(yī)學(xué)文本中蛋白質(zhì)相互作用關(guān)系抽取關(guān)鍵技術(shù)研究[D];哈爾濱工業(yè)大學(xué);2012年

5 何佳珍;不確定數(shù)據(jù)的PU學(xué)習(xí)貝葉斯分類器研究[D];西北農(nóng)林科技大學(xué);2012年

6 李滿生;基于本體的蛋白質(zhì)相互作用信息文本挖掘方法研究[D];中國(guó)人民解放軍軍事醫(yī)學(xué)科學(xué)院;2010年

7 虞歡歡;基于機(jī)器學(xué)習(xí)的蛋白質(zhì)相互作用關(guān)系抽取的研究[D];蘇州大學(xué);2010年

8 戴文淵;基于實(shí)例和特征的遷移學(xué)習(xí)算法研究[D];上海交通大學(xué);2009年

，

本文編號(hào)：2273008

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2273008.html

上一篇：一種新的基于參數(shù)估計(jì)的自適應(yīng)雙邊濾波算法
下一篇：基于GIS的公路設(shè)施管理系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級(jí)|國(guó)家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于生物醫(yī)學(xué)文本挖掘的蛋白質(zhì)間相互作用關(guān)系抽取方法的研究