生物醫(yī)學(xué)文本中藥物信息抽取方法研究

發(fā)布時(shí)間：2019-06-10 07:11

【摘要】：隨著生物醫(yī)學(xué)研究及互聯(lián)網(wǎng)技術(shù)的發(fā)展,互聯(lián)網(wǎng)上可獲取的生物醫(yī)學(xué)文獻(xiàn)數(shù)量急劇增長(zhǎng)。海量非結(jié)構(gòu)化的生物醫(yī)學(xué)文獻(xiàn)中蘊(yùn)含著豐富的、有價(jià)值的知識(shí)。藥物作為一種被廣泛研究的生物醫(yī)學(xué)實(shí)體,是相關(guān)知識(shí)的重要載體。從非結(jié)構(gòu)化的生物醫(yī)學(xué)文本中抽取出結(jié)構(gòu)化的藥物信息既能服務(wù)于相關(guān)領(lǐng)域的研究人員與醫(yī)療專業(yè)人員,又能擴(kuò)充、更新現(xiàn)有的藥物知識(shí)庫(kù)。因此,生物醫(yī)學(xué)文本中的藥物信息抽取獲得越來越多的關(guān)注,逐漸成為研究的熱點(diǎn)。當(dāng)前藥物信息抽取的研究主要集中在藥物名識(shí)別及藥物之間相互作用關(guān)系抽取兩個(gè)問題上,相關(guān)方法的性能尚不能滿足實(shí)際應(yīng)用的需要。因此,本文圍繞這兩個(gè)問題展開深入研究。主要研究?jī)?nèi)容包括以下幾個(gè)部分:第一,基于多語(yǔ)義特征融合的藥物名識(shí)別方法。基于藥物名詞典的語(yǔ)義特征對(duì)識(shí)別藥物名具有很大幫助,被廣泛用于基于機(jī)器學(xué)習(xí)的藥物名識(shí)別方法中。但由于藥物名詞典覆蓋范圍有限、更新不及時(shí)等原因,基于藥物名詞典的語(yǔ)義特征存在一定的局限性。本文注意到大規(guī)模非結(jié)構(gòu)化的生物醫(yī)學(xué)文獻(xiàn)中包含大量未登錄的藥物名。為彌補(bǔ)基于詞典的語(yǔ)義特征的不足,本文提出一種基于多語(yǔ)義特征融合的藥物名識(shí)別方法。該方法利用大規(guī)模非結(jié)構(gòu)化的生物醫(yī)學(xué)文獻(xiàn)生成基于詞向量的語(yǔ)義特征,并將其與基于藥物名詞典生成的語(yǔ)義特征聯(lián)合用于藥物名識(shí)別。實(shí)驗(yàn)結(jié)果表明,基于多語(yǔ)義特征融合的藥物名識(shí)別方法性能優(yōu)于使用單一語(yǔ)義特征的方法。第二,基于特征組合與特征選擇的藥物名識(shí)別方法。特征組合是指將多個(gè)不同類型的簡(jiǎn)單特征組合為一個(gè)組合特征。相比于簡(jiǎn)單特征,組合特征的優(yōu)勢(shì)在于其能表示語(yǔ)句中詞的多個(gè)屬性。在藥物名識(shí)別問題中,可能的特征組合方式很多,直接將簡(jiǎn)單特征組合會(huì)產(chǎn)生數(shù)量龐大的組合特征,且包含大量噪聲,影響模型的性能。因此,除了n元文法特征外,現(xiàn)有的藥物名識(shí)別方法通常僅使用簡(jiǎn)單特征。為了有效利用組合特征,本文提出了一種面向藥物名識(shí)別的特征生成框架。該框架包含特征組合與特征選擇兩個(gè)模塊,特征組合模塊將簡(jiǎn)單特征組合得到組合特征,特征選擇模塊去除特征集合中的大量噪聲。本文基于該框架將詞向量特征、詞典特征及通用特征組合,將得到的特征用于條件隨機(jī)場(chǎng)模型進(jìn)行藥物名識(shí)別。實(shí)驗(yàn)結(jié)果表明,基于特征組合與特征選擇的藥物名識(shí)別方法性能優(yōu)于僅使用簡(jiǎn)單特征的藥物名識(shí)別方法。第三,基于文本序列卷積神經(jīng)網(wǎng)絡(luò)的藥物相互作用關(guān)系抽取方法�，F(xiàn)有的性能較好的藥物相互作用關(guān)系抽取方法是基于支持向量機(jī)的方法。這類方法使用大量的人工定義特征且需要各種外部自然語(yǔ)言處理工具來生成這些特征。因此,其性能受外部自然語(yǔ)言處理工具的影響較大。為了減少對(duì)外部自然語(yǔ)言處理工具的依賴,本文提出一種基于文本序列卷積神經(jīng)網(wǎng)絡(luò)的藥物相互作用關(guān)系抽取方法。該方法只需要輸入由無監(jiān)督的深度學(xué)習(xí)算法得到的詞向量以及隨機(jī)初始化的位置向量,通過文本序列卷積與最大池化操作自動(dòng)學(xué)習(xí)得到特征,用于softmax分類器進(jìn)行關(guān)系抽取。實(shí)驗(yàn)結(jié)果表明,該方法性能優(yōu)于傳統(tǒng)的基于支持向量機(jī)的方法。第四,基于依存結(jié)構(gòu)卷積神經(jīng)網(wǎng)絡(luò)的藥物相互作用關(guān)系抽取方法。基于文本序列卷積神經(jīng)網(wǎng)絡(luò)的藥物相互作用關(guān)系抽取方法忽略了詞之間的長(zhǎng)距離依存關(guān)系,而這種依存關(guān)系對(duì)藥物相互作用關(guān)系抽取很重要。因此,本文提出一種基于依存結(jié)構(gòu)卷積神經(jīng)網(wǎng)絡(luò)的藥物相互作用關(guān)系抽取方法,將詞之間的長(zhǎng)距離依存關(guān)系融入卷積神經(jīng)網(wǎng)絡(luò)模型。實(shí)驗(yàn)結(jié)果表明,引入詞之間的長(zhǎng)距離依存關(guān)系能提升藥物相互作用關(guān)系抽取的性能。句法分析器對(duì)長(zhǎng)句的依存句法分析結(jié)果錯(cuò)誤較多,這些錯(cuò)誤傳播到依存結(jié)構(gòu)卷積神經(jīng)網(wǎng)絡(luò)模型中,會(huì)影響模型的性能。為避免錯(cuò)誤傳播,本文根據(jù)語(yǔ)句長(zhǎng)度將基于文本序列與基于依存結(jié)構(gòu)的卷積神經(jīng)網(wǎng)絡(luò)方法組合。實(shí)驗(yàn)結(jié)果表明,這種組合能進(jìn)一步提升藥物相互作用關(guān)系抽取的性能。
[Abstract]:With the development of biomedical research and Internet technology, the number of biomedical literature available on the Internet has increased dramatically. The mass of unstructured biomedical literature contains rich and valuable knowledge. As a biomedical entity that is widely studied, the drug is an important carrier of relevant knowledge. Extracting the structured drug information from the unstructured biomedical text can serve both the researchers and the medical professionals in the relevant field, and can be expanded and updated to update the existing drug knowledge base. As a result, more and more attention has been paid to the extraction of drug information in the biomedical texts, becoming the focus of the study. The current study of drug information extraction is mainly focused on the two problems of drug name recognition and drug-drug interaction, and the performance of the related methods can not meet the needs of the practical application. Therefore, this paper studies the two problems. The main research contents include the following parts: First, the method of drug name recognition based on multi-semantic feature fusion. The semantic feature of the drug-name dictionary has great help to identify the drug name, and is widely used in the drug name recognition method based on machine learning. However, the semantic features of the drug-name dictionary have some limitations due to the limited coverage of the drug-name dictionary and the non-timeliness of the update. It is noted in this document that large-scale unstructured biomedical literature contains a large number of unregistered drug names. In order to make up for the deficiency of the semantic features based on the dictionary, this paper proposes a method of drug name recognition based on multi-semantic feature fusion. The method utilizes large-scale unstructured biomedical literature to generate semantic features based on word vectors and is used in combination with the semantic features generated by the drug name dictionary for drug name recognition. The experimental results show that the performance of the drug name recognition method based on the multi-semantic feature fusion is superior to that of using a single semantic feature. And secondly, identifying the drug name based on the feature combination and the feature selection. A feature combination is to combine a plurality of different types of simple features into one combined feature. The advantage of a combination feature is that it can represent a number of attributes of a word in a statement, as compared to a simple feature. In the problem of drug name recognition, there are many possible combinations of features, which directly combine simple features to produce a large number of combined features, and contain a lot of noise and affect the performance of the model. Thus, in addition to the n-gram feature, the existing drug name recognition method generally uses only a simple feature. In order to effectively use the combination character, this paper presents a feature generation framework for drug-name recognition. The framework comprises a feature combination and a feature selection module, wherein the feature combination module combines the simple feature combination to obtain the combined feature, and the feature selection module removes a large amount of noise in the feature set. Based on the framework, the feature of the word vector, the character of the dictionary and the general characteristic combination are combined, and the obtained characteristics are used for the identification of the drug name with the airport model. The experimental results show that the performance of the drug name recognition method based on the feature combination and feature selection is superior to the drug name recognition method using only the simple feature. And thirdly, a method for extracting a drug interaction relationship based on a text-sequence convolution neural network. The traditional method for extracting the drug interaction relationship with good performance is based on a support vector machine. Such methods use a large number of human-defined features and require various external natural language processing tools to generate these features. As a result, its performance is greatly affected by the external natural language processing tool. In order to reduce the dependence of external natural language processing tools, this paper presents a method for extracting drug interaction relation based on a text-sequence convolution neural network. The method only needs to input the word vector obtained by the unsupervised depth learning algorithm and the randomly initialized position vector, and the feature is automatically learned through the convolution of the text sequence and the maximum pool operation, and is used for the relation extraction of the softmax classifier. The experimental results show that the method is superior to the traditional method based on the support vector machine. And fourthly, a method for extracting a drug interaction relationship based on a dependent structure convolution neural network. The method of drug-interaction relationship extraction based on the text-series convolution neural network ignores the long-distance dependence of words, which is important for the extraction of drug-interaction relationship. In this paper, a method for extracting the drug interaction relation based on the convolution neural network of the dependent structure is proposed, and the long-distance dependency relationship between the words is integrated into the convolution neural network model. The experimental results show that the long-distance relationship between the words can improve the performance of drug interaction. The syntax analysis of the long sentences has many errors, and these errors are propagated to the dependent structure convolution neural network model, which can affect the performance of the model. In order to avoid the error propagation, this paper combines a text-based sequence with a dependent structure-based convolution neural network method according to the length of the sentence. The experimental results show that this combination can further improve the performance of drug interaction.
【學(xué)位授予單位】：哈爾濱工業(yè)大學(xué)
【學(xué)位級(jí)別】：博士
【學(xué)位授予年份】：2016
【分類號(hào)】：TP391.1

【相似文獻(xiàn)】

相關(guān)期刊論文前10條

1 石楨;姚天f ;;一種基于統(tǒng)計(jì)和規(guī)則的核心地名抽取方法[J];微型電腦應(yīng)用;2013年02期

2 張世輝;一種新的基于距離的漢字筆畫抽取方法[J];計(jì)算機(jī)工程;2003年14期

3 王大亮;涂序彥;鄭雪峰;佟子健;;多策略融合的搭配抽取方法[J];清華大學(xué)學(xué)報(bào)(自然科學(xué)版);2008年04期

4 楊建明;;關(guān)系抽取方法研究[J];電子技術(shù);2009年04期

5 孫繼鵬;賈民;劉增寶;;一種面向文本的概念抽取方法的研究[J];計(jì)算機(jī)應(yīng)用與軟件;2009年09期

6 鄭偉;呂建新;張建偉;;文本分類中特征預(yù)抽取方法研究[J];情報(bào)科學(xué);2011年01期

7 肖明軍,張巍,鄒翔,蔡慶生;一種多策略聯(lián)合信息抽取方法[J];小型微型計(jì)算機(jī)系統(tǒng);2005年04期

8 郝博一;夏云慶;鄔曉鈞;鄭方;劉軼;;基于泛化和繁殖的自舉式意見目標(biāo)抽取方法[J];清華大學(xué)學(xué)報(bào)(自然科學(xué)版);2009年S1期

9 栗春亮;朱艷輝;徐葉強(qiáng);;中文產(chǎn)品評(píng)論中屬性詞抽取方法研究[J];計(jì)算機(jī)工程;2011年12期

10 蔡虹,葉水生;基于KPS的Web信息抽取[J];計(jì)算機(jī)與現(xiàn)代化;2005年06期

相關(guān)會(huì)議論文前10條

1 宋濤;李素建;;基于流形排序的領(lǐng)域詞抽取方法[A];第五屆全國(guó)青年計(jì)算語(yǔ)言學(xué)研討會(huì)論文集[C];2010年

2 卞真旭;;一種關(guān)鍵詞抽取方法研究[A];2011年安徽省智能電網(wǎng)技術(shù)論壇論文集[C];2011年

3 羅斐;毛宇光;;基于領(lǐng)域分類的查詢接口模式抽取方法[A];2009年研究生學(xué)術(shù)交流會(huì)通信與信息技術(shù)論文集[C];2009年

4 栗春亮;朱艷輝;徐葉強(qiáng);;中文產(chǎn)品評(píng)論中屬性詞抽取方法研究[A];第六屆全國(guó)信息檢索學(xué)術(shù)會(huì)議論文集[C];2010年

5 劉昊;王健;林鴻飛;;一種模板與圖核融合的蛋白質(zhì)關(guān)系抽取方法[A];第六屆全國(guó)信息檢索學(xué)術(shù)會(huì)議論文集[C];2010年

6 翁偉;王厚峰;;基于LDA的關(guān)鍵詞抽取方法[A];第五屆全國(guó)青年計(jì)算語(yǔ)言學(xué)研討會(huì)論文集[C];2010年

7 何莉;林鴻飛;;一種面向WEB的生物醫(yī)學(xué)領(lǐng)域英漢術(shù)語(yǔ)翻譯對(duì)抽取方法[A];中國(guó)計(jì)算機(jī)語(yǔ)言學(xué)研究前沿進(jìn)展（2007-2009）[C];2009年

8 左云存;宗成慶;;基于HMM的短語(yǔ)翻譯對(duì)抽取方法[A];全國(guó)第八屆計(jì)算語(yǔ)言學(xué)聯(lián)合學(xué)術(shù)會(huì)議（JSCL-2005）論文集[C];2005年

9 王裴巖;張桂平;白宇;;一種基于核函數(shù)的技術(shù)關(guān)鍵詞連接關(guān)系抽取方法[A];第六屆全國(guó)信息檢索學(xué)術(shù)會(huì)議論文集[C];2010年

10 蒲宇達(dá);關(guān)毅;王強(qiáng);;基于數(shù)據(jù)挖掘思想的網(wǎng)頁(yè)正文抽取方法的研究[A];第三屆學(xué)生計(jì)算語(yǔ)言學(xué)研討會(huì)論文集[C];2006年

相關(guān)博士學(xué)位論文前2條

1 劉勝宇;生物醫(yī)學(xué)文本中藥物信息抽取方法研究[D];哈爾濱工業(yè)大學(xué);2016年

2 李傳席;基于本體的自適應(yīng)Web信息抽取方法研究[D];中國(guó)科學(xué)技術(shù)大學(xué);2012年

相關(guān)碩士學(xué)位論文前10條

1 陳倩;基于特征模型的跨領(lǐng)域信息抽取方法研究[D];上海大學(xué);2015年

2 劉驍;基于產(chǎn)品評(píng)論的意見抽取方法研究[D];黑龍江大學(xué);2015年

3 洪軍建;面向社會(huì)網(wǎng)絡(luò)應(yīng)用的人物關(guān)系抽取方法研究[D];西藏大學(xué);2016年

4 梅莉莉;基于領(lǐng)域特殊性和統(tǒng)計(jì)語(yǔ)言知識(shí)的新詞抽取方法[D];北京理工大學(xué);2016年

5 陳亞東;面向數(shù)據(jù)稀疏問題的英文事件抽取研究[D];蘇州大學(xué);2016年

6 朱珠;基于雙語(yǔ)的事件抽取方法研究[D];蘇州大學(xué);2016年

7 余偉;基于領(lǐng)域知識(shí)的Web信息抽取方法研究[D];安徽工程大學(xué);2016年

8 呂云云;基于集成學(xué)習(xí)的中文觀點(diǎn)句抽取方法研究[D];山西大學(xué);2013年

9 楊云;基于句法結(jié)構(gòu)的評(píng)價(jià)對(duì)象抽取方法研究[D];東北師范大學(xué);2015年

10 方瑩;基于句子聚類的信息抽取方法研究[D];山西大學(xué);2005年

，

本文編號(hào)：2496275

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會(huì)員下載

Download by Member

本文鏈接：http://sikaile.net/shoufeilunwen/xxkjbs/2496275.html

上一篇：基于錯(cuò)誤特征的NAND Flash存儲(chǔ)策略研究
下一篇：基于上下文的移動(dòng)多媒體信息標(biāo)注和管理及關(guān)鍵技術(shù)研究

論文發(fā)表

·知網(wǎng)|萬方|維普|龍?jiān)磡省級(jí)|國(guó)家級(jí)|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

生物醫(yī)學(xué)文本中藥物信息抽取方法研究