面向癥狀表型的命名實體抽取方法研究

發(fā)布時間：2019-06-16 19:14

【摘要】：癥狀表型(癥狀體征)是臨床數(shù)據(jù)和醫(yī)學題錄文獻數(shù)據(jù)中重要的實體性信息,是中西醫(yī)診斷與治療的主要依據(jù)。但醫(yī)學數(shù)據(jù)中的癥狀表型信息往往以自由文本型的臨床病歷(以主訴和現(xiàn)病史為主要文本內(nèi)容)和題錄文獻數(shù)據(jù)為主要載體,因此,進行癥狀表型命名實體抽取是利用癥狀表型信息的首要關(guān)鍵步驟。近年來,面向臨床病歷的命名實體抽取成為熱點方向,但主要相關(guān)研究以疾病,藥物和臨床問題等為主要抽取目標,對更具復(fù)雜性的癥狀表型實體抽取研究仍涉及較少。鑒于癥狀表型信息在中醫(yī)診療中的重要性,本文結(jié)合中醫(yī)臨床病歷(以現(xiàn)病史為主)和PubMed題錄文獻文本開展癥狀表型命名實體的抽取方法研究,通過構(gòu)建的較大規(guī)模語料集和未標注數(shù)據(jù),進行了基于Bootstrapping,分類學習(條件隨機場和結(jié)構(gòu)化支持向量機)和特征學習(詞嵌入與網(wǎng)絡(luò)嵌入)等多種方法研究,具體研究工作包括如下三個方面。(1)在人工審核和數(shù)據(jù)預(yù)處理的基礎(chǔ)上,構(gòu)建了包含1200個以現(xiàn)病史為主的中醫(yī)臨床病歷標注語料。在此基礎(chǔ)上,分別研制了基于Bootstrapping的無監(jiān)督癥狀表型實體抽取方法和基于條件隨機場(CRF)的命名實體抽取方法,其F1值分別達到64.73%和95.03%,表明CRF基本達到了從臨床病歷現(xiàn)病史文本中抽取癥狀表型實體的要求;為測試完全開放性的抽取性能,本文分別構(gòu)建了不同病種,主訴和現(xiàn)病史,以及首診與復(fù)診等交叉測試語料,CRF的性能分別達到82%,58.21%和81.18%等,為后續(xù)進一步的遷移性命名實體抽取方法研究提供了借鑒。(2)通過引入深度特征表示方法(詞嵌入和網(wǎng)絡(luò)嵌入方法),結(jié)合結(jié)構(gòu)化支持向量機(SSVM)與CRF分類模型,整合未標注臨床病歷數(shù)據(jù),研制了多種癥狀表型實體抽取方法(WENER和GENER方法),WENER方法的F1值分別達到了 98.08%(SSVM)和97.63%(CRF);基于字特征的GENER方法的F1值分別達到88.42%和86.01%,而基于詞特征的GENER方法的F1值分別達到了 95.04%和 95.00%。(3)針對醫(yī)學文獻中癥狀表型實體抽取問題,利用1200條PubMed題錄文獻數(shù)據(jù),應(yīng)用WENER和GENER方法進行分析實驗研究,研究表明,WENER方法的F1值分別達到93.58%和93.23%;GENER方法的F1值分別達到93.57%和92.04%。以上研究表明,基于深度表示的癥狀表型實體命名抽取方法在未標注語料的整合與性能方面都存在較大優(yōu)勢,已經(jīng)具備一定的中英文命名實體抽取實用價值。通過整合更大規(guī)模的未標注語料,將為各類型醫(yī)學命名實體的高性能抽取提供技術(shù)基礎(chǔ),從而促進大規(guī)模醫(yī)學知識圖譜的構(gòu)建和發(fā)展。
[Abstract]:Symptoms and phenotypes (symptoms and signs) are important substantive information in clinical data and medical subject literature data, and are the main basis for diagnosis and treatment of traditional Chinese and western medicine. However, the symptom phenotypic information in medical data is often based on the free text clinical medical records (with the main complaint and the present disease history as the main text content) and the subject record literature data as the main carrier. Therefore, the extraction of symptom phenotypic named entity is the first key step to use the symptom phenotypic information. In recent years, named entity extraction for clinical medical records has become a hot direction, but the main related research focuses on diseases, drugs and clinical problems, but the research on more complex phenotypic entity extraction is still less involved. In view of the importance of symptom phenotypic information in TCM diagnosis and treatment, this paper studies the extraction method of symptom phenotypic naming entity combined with TCM clinical medical records (mainly current medical history) and PubMed inscription literature text. Through the constructed large-scale corpus set and unmarked data, various methods, such as Bootstrapping, classification learning (conditional random field and structured support vector machine) and feature learning (word embedding and network embedding), are carried out. The specific research work includes the following three aspects. (1) on the basis of manual audit and data preprocessing, 1200 tagging corpus of clinical medical records of traditional Chinese medicine (TCM) with current medical history is constructed. On this basis, the unsupervised symptom phenotypic entity extraction method based on Bootstrapping and the named entity extraction method based on conditional random field (CRF) were developed respectively. the F1 values reached 64.73% and 95.03% respectively, which indicated that CRF basically met the requirements of extracting symptom phenotypic entity from the current medical history text of clinical medical records. In order to test the completely open extraction performance, different types of diseases, main complaint and current medical history, as well as cross-test corpus such as first diagnosis and rediagnosis, were constructed in this paper. The performance of CRF reached 82%, 58.21% and 81.18%, respectively, which provided a reference for further research on migration named entity extraction. (2) by introducing depth feature representation (word embedding and network embedding), Combined with structured support vector machine (SSVM) and CRF classification model and unmarked clinical medical record data, a variety of symptom phenotypic entity extraction methods (WENER and GENER), WENER methods with F1 values of 98.08% (SSVM) and 97.63% (CRF);, respectively) were developed. The F1 values of GENER method based on word features are 88.42% and 86.01%, respectively, while those of GENER method based on word features are 95.04% and 95.00%, respectively. (3) in order to solve the problem of symptom phenotypic entity extraction in medical literature, the F1 values of WENER method are 93.58% and 93.23%, respectively, using the literature data of 1200 PubMed titles and WENER and GENER methods. The F1 values of GENER method are 93.57% and 92.04%, respectively. The above research shows that the naming and extraction method of symptom phenotypic entities based on depth representation has great advantages in the integration and performance of unmarked corpus, and has a certain practical value in Chinese and English named entity extraction. By integrating larger unmarked corpus, it will provide a technical basis for the high performance extraction of various types of medical named entities, thus promoting the construction and development of large-scale medical knowledge graph.
【學位授予單位】：北京交通大學
【學位級別】：碩士
【學位授予年份】：2017
【分類號】：TP391.1

【參考文獻】

相關(guān)期刊論文前2條

1 袁玉虎;周雪忠;張潤順;李曉東;;面向中醫(yī)臨床現(xiàn)病史文本的命名實體抽取方法研究[J];世界科學技術(shù)-中醫(yī)藥現(xiàn)代化;2017年01期

2 孟洪宇;孟慶剛;;基于條件隨機場的中醫(yī)術(shù)語抽取方法及其應(yīng)用探析[J];中華中醫(yī)藥學刊;2014年10期

相關(guān)博士學位論文前1條

1 周雪忠;文本挖掘在中醫(yī)藥中的若干應(yīng)用研究[D];浙江大學;2004年

相關(guān)碩士學位論文前1條

1 劉凱;基于條件隨機場的中醫(yī)病歷命名實體抽取方法研究[D];北京交通大學;2013年

，

本文編號：2500769

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2500769.html

上一篇：基于NiosⅡ的SPIHT算法圖像壓縮卡的設(shè)計
下一篇：基于關(guān)聯(lián)規(guī)則的學生學業(yè)測評質(zhì)量分析系統(tǒng)的研究與應(yīng)用

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

面向癥狀表型的命名實體抽取方法研究