天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

基于Hadoop平臺和隱馬爾可夫模型的生物醫(yī)學命名實體識別方法研究

發(fā)布時間:2018-03-30 04:08

  本文選題:生物醫(yī)學命名實體識別 切入點:隱馬爾可夫模型 出處:《西北農(nóng)林科技大學》2017年碩士論文


【摘要】:生物醫(yī)學作為一門交叉性學科經(jīng)過近年的不斷發(fā)展,其專業(yè)知識量不斷增加,與其相關的文本資料也越來越多。這些海量的文本資料中包含著許多有價值的信息和數(shù)據(jù),目前基于大數(shù)據(jù)的生物醫(yī)學文本挖掘技術的目的就是將這些有用信息從海量數(shù)據(jù)中提取出來以供研究者使用。生物醫(yī)學命名實體識別工作是生物醫(yī)學文本挖掘技術中的關鍵步驟。針對傳統(tǒng)集中式的生物醫(yī)學命名實體識別方法難以處理海量文本數(shù)據(jù)的問題,本研究在Hadoop平臺上采用分布式計算方法進行命名實體識別模型訓練并對大規(guī)模數(shù)據(jù)進行處理。研究過程主要可分為以下兩部分:(1)在Hadoop平臺上完成HMM模型的參數(shù)訓練,通過統(tǒng)計訓練語料庫中初始狀態(tài)的分布情況,狀態(tài)與狀態(tài)之間的轉移次數(shù),以及每個狀態(tài)發(fā)射出觀察值的分布,得到HMM模型的初始狀態(tài)概率分布,狀態(tài)轉移概率矩陣和符號發(fā)射概率矩陣三個參數(shù)。為了驗證HMM模型在Hadoop平臺上的參數(shù)訓練效率和命名實體識別性能,使用CRF模型與其進行對比。在Hadoop平臺上并行化計算CRF模型中特征函數(shù)權重的梯度向量,并迭代計算出最優(yōu)的模型參數(shù)。兩個模型在Hadoop平臺上的對比結果顯示,在訓練數(shù)據(jù)相同的情況下,CRF模型識別性能略高于HMM模型,但在Hadoop平臺上進行模型訓練時隨著數(shù)據(jù)量的不斷增大HMM模型訓練效率遠高于CRF模型。本文選用HMM模型在Hadoop平臺上對大規(guī)模生物醫(yī)學文本進行命名實體識別。(2)在Hadoop平臺上使用HMM模型進行生物醫(yī)學命名實體識別,該操作分為兩個MapReduce過程:過程一,對測試數(shù)據(jù)進行數(shù)據(jù)清洗操作,去除產(chǎn)生噪聲干擾的無用信息并得到新的測試數(shù)據(jù);過程二,在Map階段完成句子分割,標記分詞和詞性標注的處理過程,并將帶有詞性標簽的句子作為輸出發(fā)送給Reduce階段;Reduce階段調(diào)用維特比算法根據(jù)(1)中訓練好的HMM模型參數(shù)對句子進行命名實體名稱標記,并最終輸出帶有生物醫(yī)學命名實體標簽的句子。在Hadoop平臺上的實驗結果表明,面對大規(guī)模的生物醫(yī)學文本使用Hadoop平臺進行命名實體識別的效率遠高于單機處理過程,可以節(jié)省大量處理時間。
[Abstract]:Biomedicine, as a cross-disciplinary subject, has been developing continuously in recent years, and its professional knowledge has been increasing, and more and more text materials are related to biomedicine. These vast amounts of text materials contain a lot of valuable information and data. The purpose of the current biomedical text mining technology based on big data is to extract the useful information from massive data for use by researchers. Biomedical named entity recognition is a biomedical text mining technique. Key steps during the operation. To solve the problem that traditional centralized biomedical named entity recognition method is difficult to deal with massive text data, In this study, the named entity recognition model is trained on Hadoop platform with distributed computing method and large-scale data is processed. The research process can be divided into the following two parts: 1) the parameter training of HMM model is completed on Hadoop platform. The initial state probability distribution of the HMM model is obtained by statistical analysis of the distribution of the initial state, the number of transitions between states and the distribution of observed values emitted from each state in the training corpus. In order to verify the parameter training efficiency and named entity recognition performance of HMM model on Hadoop platform, the state transition probability matrix and symbol transmit probability matrix are three parameters. The gradient vector of the eigenfunction weight in the CRF model is calculated by parallelization on the Hadoop platform, and the optimal model parameters are calculated iteratively. The comparison results of the two models on the Hadoop platform show that, by comparing the two models with the CRF model, the gradient vector of the eigenfunction weight in the CRF model is calculated in parallel. With the same training data, the recognition performance of CRF model is slightly higher than that of HMM model. However, the training efficiency of HMM model is much higher than that of CRF model with the increasing of data volume on Hadoop platform. This paper chooses HMM model to identify large-scale biomedical text on Hadoop platform. HMM model is used to identify biomedical named entities on the platform. The operation is divided into two MapReduce processes: one is to clean the test data to remove the unwanted information and get the new test data, the other is to complete the sentence segmentation in the Map phase. The process of tagging words and parts of speech is processed, and the sentences with part of speech labels are sent as output to the Reduce stage / reduce stage to call Viterbi algorithm according to the trained HMM model parameters in the Reduce stage to mark the named entity names of the sentences. Finally, the sentences with biomedical named entity tags are output. The experimental results on Hadoop platform show that the efficiency of using Hadoop platform to recognize named entities in large-scale biomedical texts is much more efficient than that in the process of single machine processing. Can save a lot of processing time.
【學位授予單位】:西北農(nóng)林科技大學
【學位級別】:碩士
【學位授予年份】:2017
【分類號】:R318

【參考文獻】

相關期刊論文 前10條

1 李麗雙;何紅磊;劉珊珊;黃德根;;基于詞表示方法的生物醫(yī)學命名實體識別[J];小型微型計算機系統(tǒng);2016年02期

2 史航;高雯s,

本文編號:1684276


資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/yixuelunwen/swyx/1684276.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權申明:資料由用戶b6098***提供,本站僅收錄摘要或目錄,作者需要刪除請E-mail郵箱bigeng88@qq.com