多項文本挖掘關鍵技術的研究和實現(xiàn)
本文選題:文本挖掘 + 新詞發(fā)現(xiàn); 參考:《哈爾濱工業(yè)大學》2017年碩士論文
【摘要】:文本挖掘是指通過計算機對文本進行的信息挖掘、含義分析、分類標注和關聯(lián)分析等處理,可以從文本中提取出能為人所用的信息乃至于知識。互聯(lián)網(wǎng)行業(yè)和各產(chǎn)業(yè)的信息化發(fā)展為文本挖掘提供了豐富的文本語料資源,也同時要求文本挖掘系統(tǒng)的準確性、有效性、運算效率和個性化水平不斷提升。文本挖掘要求從純文本中提取出有價值的信息并為信息化事業(yè)的發(fā)展提供基礎,其中屬于特定語義類別的新詞、文本事件類別、文本事件元素和文檔摘要是應用廣泛的文本信息。本文研究并實現(xiàn)了解決文本挖掘中多個核心問題的方法,包括面向特定語義類別的新詞發(fā)現(xiàn),面向ACE2005語料的事件類別識別和在事件類別信息基礎上的事件元素識別,以及面向單文檔和多文檔的自動摘要。新詞發(fā)現(xiàn)、事件識別和自動摘要系統(tǒng)均在各自的標注語料中進行了實驗,并取得了較為理想的效果。面向特定語義類別的新詞發(fā)現(xiàn)方面,本文考慮到對語料進行類別標注的成本較高,從同類新詞具有相似的上下文信息的角度出發(fā),設計了一種基于bootstrapping和軟模式匹配的新詞發(fā)現(xiàn)方法,根據(jù)語義特點將新詞拆分成多個部分,并根據(jù)新詞部分將新詞所在句子分割為多個槽,通過統(tǒng)計已標注新詞和候選新詞各詞頻槽的詞向量相似度和詞頻向量相似度為候選新詞打分,并將評分較高的候選新詞加入已標注新詞。本文在電子病歷語料中進行了實驗,將癥狀新詞拆分成部位,性狀兩部分,癥狀新詞發(fā)現(xiàn)的F值達到了81.40%。面向ACE2005語料的事件類別識別和事件元素識別方面,本文在其他研究者基于支持向量機分類器的方法基礎上進行了改進。在事件類別識別中,本文根據(jù)同句中各個候選觸發(fā)詞的位置和觸發(fā)事件的信息,加入了一些和候選觸發(fā)詞和候選元素相關的特征,并優(yōu)化了文本信息預處理的方法;趲в惺录䴓撕炓约跋鄳膶嶓w、時間、數(shù)值標注的中英文ACE2005語料,本文衡量了事件類別識別和事件元素識別方法的效果,在事件元素識別中也加入了和實體、數(shù)值和時間標簽相關的新特征。事件類別識別的F值達到了64.2%,事件元素識別的F值達到了63.7%。任務中,本文將TextRank算法和聚類方法結(jié)合起來,利用BM25算法及多種句子相似度算法設置TextRank無向圖模型中的邊權重,并通過聚類方法嘗試減少自動中的冗余信息,將句子和文檔間關系作為摘要提取的依據(jù)。系統(tǒng)在DUC2001以及DUC2002語料上進行了多種長度的單文檔和多文檔的實驗并用ROUGE工具進行了評測,取得了較好效果。
[Abstract]:Text mining refers to information mining, meaning analysis, categorization and association analysis of text through computer, which can extract information and even knowledge that can be used by people from text. The development of information technology in the Internet industry and various industries provides abundant text corpus resources for text mining. At the same time, the accuracy, validity, operational efficiency and personalized level of text mining system are also required. Text mining requires the extraction of valuable information from pure text and provides a basis for the development of information technology, which belongs to a specific semantic category of new words, text event category, Text event elements and document abstracts are widely used text information. This paper studies and implements methods to solve several core problems in text mining, including new word discovery for specific semantic categories, event class recognition for ACE2005 corpus and event element recognition based on event category information. And for single document and multi-document automatic summary. Both the event recognition and automatic summarization systems have been experimented in their tagged corpus, and satisfactory results have been achieved. With regard to the discovery of new words for specific semantic categories, this paper takes into account the high cost of classifying the corpus, starting from the point of view that similar new words have similar contextual information. In this paper, a new word discovery method based on bootstrapping and soft pattern matching is designed. According to the semantic characteristics, the new word is divided into several parts, and the new word sentence is divided into multiple slots according to the new word part. Word vector similarity and word frequency vector similarity of tagged neologisms and candidate neologisms were counted as candidate neologisms, and tagged neologisms were added to tagged neologisms. In this paper, an experiment was carried out in the electronic medical record corpus. The symptom neologisms were divided into two parts, and the F value of symptom neologisms was 81.40g. In the aspect of event class recognition and event element recognition for ACE2005 corpus, this paper improves the method based on support vector machine classifier. According to the position of each candidate trigger word and the information of trigger event in the same sentence, this paper adds some features related to candidate trigger word and candidate element, and optimizes the method of text information preprocessing. Based on the Chinese and English ACE2005 corpus with event label and corresponding entity, time and value, this paper measures the effect of event class recognition and event element recognition, and also adds and entity to event element recognition. New features related to numerical and time labels. The F value of event category recognition is 64.2 and that of event element recognition is 63.7. In the task, we combine TextRank algorithm with clustering method, use BM25 algorithm and sentence similarity algorithm to set edge weight in TextRank undirected graph model, and try to reduce redundant information by clustering method. The relation between sentence and document is used as the basis of abstract extraction. The experiment of single document and multiple document on DUC2001 and DUC2002 corpus has been carried out and evaluated with ROUGE tool, and good results have been obtained.
【學位授予單位】:哈爾濱工業(yè)大學
【學位級別】:碩士
【學位授予年份】:2017
【分類號】:TP391.1
【參考文獻】
相關期刊論文 前8條
1 曲春燕;關毅;楊錦鋒;趙永杰;劉雅欣;;中文電子病歷命名實體標注語料庫構建[J];高技術通訊;2015年02期
2 楊錦鋒;于秋濱;關毅;蔣志鵬;;電子病歷命名實體識別和實體關系抽取研究綜述[J];自動化學報;2014年08期
3 蔣志鵬;趙芳芳;關毅;楊錦鋒;;面向中文電子病歷的詞法語料標注研究[J];高技術通訊;2014年06期
4 徐永東;權光日;王亞東;;基于HL7的電子病歷關鍵信息抽取技術研究[J];哈爾濱工業(yè)大學學報;2011年11期
5 胡俠;林曄;王燦;林立;;自動文本摘要技術綜述[J];情報雜志;2010年08期
6 趙妍妍;秦兵;車萬翔;劉挺;;中文事件抽取技術研究[J];中文信息學報;2008年01期
7 賀敏;龔才春;張華平;程學旗;;一種基于大規(guī)模語料的新詞識別方法[J];計算機工程與應用;2007年21期
8 鄒綱,劉洋,劉群,孟遙,于浩,西野文人,亢世勇;面向Internet的中文新詞語檢測[J];中文信息學報;2004年06期
相關會議論文 前1條
1 趙妍妍;王嘯吟;秦兵;車萬翔;劉挺;;中文事件抽取中事件類別的自動識別[A];第三屆學生計算語言學研討會論文集[C];2006年
相關博士學位論文 前1條
1 張玉龍;疾病的價值研究[D];山東大學;2012年
相關碩士學位論文 前1條
1 張立邦;基于半監(jiān)督學習的中文電子病歷分詞和名實體挖掘[D];哈爾濱工業(yè)大學;2014年
,本文編號:1830655
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1830655.html