面向癥狀表型的命名實(shí)體抽取方法研究
發(fā)布時(shí)間:2019-06-16 19:14
【摘要】:癥狀表型(癥狀體征)是臨床數(shù)據(jù)和醫(yī)學(xué)題錄文獻(xiàn)數(shù)據(jù)中重要的實(shí)體性信息,是中西醫(yī)診斷與治療的主要依據(jù)。但醫(yī)學(xué)數(shù)據(jù)中的癥狀表型信息往往以自由文本型的臨床病歷(以主訴和現(xiàn)病史為主要文本內(nèi)容)和題錄文獻(xiàn)數(shù)據(jù)為主要載體,因此,進(jìn)行癥狀表型命名實(shí)體抽取是利用癥狀表型信息的首要關(guān)鍵步驟。近年來(lái),面向臨床病歷的命名實(shí)體抽取成為熱點(diǎn)方向,但主要相關(guān)研究以疾病,藥物和臨床問(wèn)題等為主要抽取目標(biāo),對(duì)更具復(fù)雜性的癥狀表型實(shí)體抽取研究仍涉及較少。鑒于癥狀表型信息在中醫(yī)診療中的重要性,本文結(jié)合中醫(yī)臨床病歷(以現(xiàn)病史為主)和PubMed題錄文獻(xiàn)文本開(kāi)展癥狀表型命名實(shí)體的抽取方法研究,通過(guò)構(gòu)建的較大規(guī)模語(yǔ)料集和未標(biāo)注數(shù)據(jù),進(jìn)行了基于Bootstrapping,分類(lèi)學(xué)習(xí)(條件隨機(jī)場(chǎng)和結(jié)構(gòu)化支持向量機(jī))和特征學(xué)習(xí)(詞嵌入與網(wǎng)絡(luò)嵌入)等多種方法研究,具體研究工作包括如下三個(gè)方面。(1)在人工審核和數(shù)據(jù)預(yù)處理的基礎(chǔ)上,構(gòu)建了包含1200個(gè)以現(xiàn)病史為主的中醫(yī)臨床病歷標(biāo)注語(yǔ)料。在此基礎(chǔ)上,分別研制了基于Bootstrapping的無(wú)監(jiān)督癥狀表型實(shí)體抽取方法和基于條件隨機(jī)場(chǎng)(CRF)的命名實(shí)體抽取方法,其F1值分別達(dá)到64.73%和95.03%,表明CRF基本達(dá)到了從臨床病歷現(xiàn)病史文本中抽取癥狀表型實(shí)體的要求;為測(cè)試完全開(kāi)放性的抽取性能,本文分別構(gòu)建了不同病種,主訴和現(xiàn)病史,以及首診與復(fù)診等交叉測(cè)試語(yǔ)料,CRF的性能分別達(dá)到82%,58.21%和81.18%等,為后續(xù)進(jìn)一步的遷移性命名實(shí)體抽取方法研究提供了借鑒。(2)通過(guò)引入深度特征表示方法(詞嵌入和網(wǎng)絡(luò)嵌入方法),結(jié)合結(jié)構(gòu)化支持向量機(jī)(SSVM)與CRF分類(lèi)模型,整合未標(biāo)注臨床病歷數(shù)據(jù),研制了多種癥狀表型實(shí)體抽取方法(WENER和GENER方法),WENER方法的F1值分別達(dá)到了 98.08%(SSVM)和97.63%(CRF);基于字特征的GENER方法的F1值分別達(dá)到88.42%和86.01%,而基于詞特征的GENER方法的F1值分別達(dá)到了 95.04%和 95.00%。(3)針對(duì)醫(yī)學(xué)文獻(xiàn)中癥狀表型實(shí)體抽取問(wèn)題,利用1200條PubMed題錄文獻(xiàn)數(shù)據(jù),應(yīng)用WENER和GENER方法進(jìn)行分析實(shí)驗(yàn)研究,研究表明,WENER方法的F1值分別達(dá)到93.58%和93.23%;GENER方法的F1值分別達(dá)到93.57%和92.04%。以上研究表明,基于深度表示的癥狀表型實(shí)體命名抽取方法在未標(biāo)注語(yǔ)料的整合與性能方面都存在較大優(yōu)勢(shì),已經(jīng)具備一定的中英文命名實(shí)體抽取實(shí)用價(jià)值。通過(guò)整合更大規(guī)模的未標(biāo)注語(yǔ)料,將為各類(lèi)型醫(yī)學(xué)命名實(shí)體的高性能抽取提供技術(shù)基礎(chǔ),從而促進(jìn)大規(guī)模醫(yī)學(xué)知識(shí)圖譜的構(gòu)建和發(fā)展。
[Abstract]:Symptoms and phenotypes (symptoms and signs) are important substantive information in clinical data and medical subject literature data, and are the main basis for diagnosis and treatment of traditional Chinese and western medicine. However, the symptom phenotypic information in medical data is often based on the free text clinical medical records (with the main complaint and the present disease history as the main text content) and the subject record literature data as the main carrier. Therefore, the extraction of symptom phenotypic named entity is the first key step to use the symptom phenotypic information. In recent years, named entity extraction for clinical medical records has become a hot direction, but the main related research focuses on diseases, drugs and clinical problems, but the research on more complex phenotypic entity extraction is still less involved. In view of the importance of symptom phenotypic information in TCM diagnosis and treatment, this paper studies the extraction method of symptom phenotypic naming entity combined with TCM clinical medical records (mainly current medical history) and PubMed inscription literature text. Through the constructed large-scale corpus set and unmarked data, various methods, such as Bootstrapping, classification learning (conditional random field and structured support vector machine) and feature learning (word embedding and network embedding), are carried out. The specific research work includes the following three aspects. (1) on the basis of manual audit and data preprocessing, 1200 tagging corpus of clinical medical records of traditional Chinese medicine (TCM) with current medical history is constructed. On this basis, the unsupervised symptom phenotypic entity extraction method based on Bootstrapping and the named entity extraction method based on conditional random field (CRF) were developed respectively. the F1 values reached 64.73% and 95.03% respectively, which indicated that CRF basically met the requirements of extracting symptom phenotypic entity from the current medical history text of clinical medical records. In order to test the completely open extraction performance, different types of diseases, main complaint and current medical history, as well as cross-test corpus such as first diagnosis and rediagnosis, were constructed in this paper. The performance of CRF reached 82%, 58.21% and 81.18%, respectively, which provided a reference for further research on migration named entity extraction. (2) by introducing depth feature representation (word embedding and network embedding), Combined with structured support vector machine (SSVM) and CRF classification model and unmarked clinical medical record data, a variety of symptom phenotypic entity extraction methods (WENER and GENER), WENER methods with F1 values of 98.08% (SSVM) and 97.63% (CRF);, respectively) were developed. The F1 values of GENER method based on word features are 88.42% and 86.01%, respectively, while those of GENER method based on word features are 95.04% and 95.00%, respectively. (3) in order to solve the problem of symptom phenotypic entity extraction in medical literature, the F1 values of WENER method are 93.58% and 93.23%, respectively, using the literature data of 1200 PubMed titles and WENER and GENER methods. The F1 values of GENER method are 93.57% and 92.04%, respectively. The above research shows that the naming and extraction method of symptom phenotypic entities based on depth representation has great advantages in the integration and performance of unmarked corpus, and has a certain practical value in Chinese and English named entity extraction. By integrating larger unmarked corpus, it will provide a technical basis for the high performance extraction of various types of medical named entities, thus promoting the construction and development of large-scale medical knowledge graph.
【學(xué)位授予單位】:北京交通大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類(lèi)號(hào)】:TP391.1
本文編號(hào):2500769
[Abstract]:Symptoms and phenotypes (symptoms and signs) are important substantive information in clinical data and medical subject literature data, and are the main basis for diagnosis and treatment of traditional Chinese and western medicine. However, the symptom phenotypic information in medical data is often based on the free text clinical medical records (with the main complaint and the present disease history as the main text content) and the subject record literature data as the main carrier. Therefore, the extraction of symptom phenotypic named entity is the first key step to use the symptom phenotypic information. In recent years, named entity extraction for clinical medical records has become a hot direction, but the main related research focuses on diseases, drugs and clinical problems, but the research on more complex phenotypic entity extraction is still less involved. In view of the importance of symptom phenotypic information in TCM diagnosis and treatment, this paper studies the extraction method of symptom phenotypic naming entity combined with TCM clinical medical records (mainly current medical history) and PubMed inscription literature text. Through the constructed large-scale corpus set and unmarked data, various methods, such as Bootstrapping, classification learning (conditional random field and structured support vector machine) and feature learning (word embedding and network embedding), are carried out. The specific research work includes the following three aspects. (1) on the basis of manual audit and data preprocessing, 1200 tagging corpus of clinical medical records of traditional Chinese medicine (TCM) with current medical history is constructed. On this basis, the unsupervised symptom phenotypic entity extraction method based on Bootstrapping and the named entity extraction method based on conditional random field (CRF) were developed respectively. the F1 values reached 64.73% and 95.03% respectively, which indicated that CRF basically met the requirements of extracting symptom phenotypic entity from the current medical history text of clinical medical records. In order to test the completely open extraction performance, different types of diseases, main complaint and current medical history, as well as cross-test corpus such as first diagnosis and rediagnosis, were constructed in this paper. The performance of CRF reached 82%, 58.21% and 81.18%, respectively, which provided a reference for further research on migration named entity extraction. (2) by introducing depth feature representation (word embedding and network embedding), Combined with structured support vector machine (SSVM) and CRF classification model and unmarked clinical medical record data, a variety of symptom phenotypic entity extraction methods (WENER and GENER), WENER methods with F1 values of 98.08% (SSVM) and 97.63% (CRF);, respectively) were developed. The F1 values of GENER method based on word features are 88.42% and 86.01%, respectively, while those of GENER method based on word features are 95.04% and 95.00%, respectively. (3) in order to solve the problem of symptom phenotypic entity extraction in medical literature, the F1 values of WENER method are 93.58% and 93.23%, respectively, using the literature data of 1200 PubMed titles and WENER and GENER methods. The F1 values of GENER method are 93.57% and 92.04%, respectively. The above research shows that the naming and extraction method of symptom phenotypic entities based on depth representation has great advantages in the integration and performance of unmarked corpus, and has a certain practical value in Chinese and English named entity extraction. By integrating larger unmarked corpus, it will provide a technical basis for the high performance extraction of various types of medical named entities, thus promoting the construction and development of large-scale medical knowledge graph.
【學(xué)位授予單位】:北京交通大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類(lèi)號(hào)】:TP391.1
【參考文獻(xiàn)】
相關(guān)期刊論文 前2條
1 袁玉虎;周雪忠;張潤(rùn)順;李曉東;;面向中醫(yī)臨床現(xiàn)病史文本的命名實(shí)體抽取方法研究[J];世界科學(xué)技術(shù)-中醫(yī)藥現(xiàn)代化;2017年01期
2 孟洪宇;孟慶剛;;基于條件隨機(jī)場(chǎng)的中醫(yī)術(shù)語(yǔ)抽取方法及其應(yīng)用探析[J];中華中醫(yī)藥學(xué)刊;2014年10期
相關(guān)博士學(xué)位論文 前1條
1 周雪忠;文本挖掘在中醫(yī)藥中的若干應(yīng)用研究[D];浙江大學(xué);2004年
相關(guān)碩士學(xué)位論文 前1條
1 劉凱;基于條件隨機(jī)場(chǎng)的中醫(yī)病歷命名實(shí)體抽取方法研究[D];北京交通大學(xué);2013年
,本文編號(hào):2500769
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2500769.html
最近更新
教材專(zhuān)著