多項(xiàng)文本挖掘關(guān)鍵技術(shù)的研究和實(shí)現(xiàn)
本文選題:文本挖掘 + 新詞發(fā)現(xiàn)。 參考:《哈爾濱工業(yè)大學(xué)》2017年碩士論文
【摘要】:文本挖掘是指通過(guò)計(jì)算機(jī)對(duì)文本進(jìn)行的信息挖掘、含義分析、分類標(biāo)注和關(guān)聯(lián)分析等處理,可以從文本中提取出能為人所用的信息乃至于知識(shí);ヂ(lián)網(wǎng)行業(yè)和各產(chǎn)業(yè)的信息化發(fā)展為文本挖掘提供了豐富的文本語(yǔ)料資源,也同時(shí)要求文本挖掘系統(tǒng)的準(zhǔn)確性、有效性、運(yùn)算效率和個(gè)性化水平不斷提升。文本挖掘要求從純文本中提取出有價(jià)值的信息并為信息化事業(yè)的發(fā)展提供基礎(chǔ),其中屬于特定語(yǔ)義類別的新詞、文本事件類別、文本事件元素和文檔摘要是應(yīng)用廣泛的文本信息。本文研究并實(shí)現(xiàn)了解決文本挖掘中多個(gè)核心問(wèn)題的方法,包括面向特定語(yǔ)義類別的新詞發(fā)現(xiàn),面向ACE2005語(yǔ)料的事件類別識(shí)別和在事件類別信息基礎(chǔ)上的事件元素識(shí)別,以及面向單文檔和多文檔的自動(dòng)摘要。新詞發(fā)現(xiàn)、事件識(shí)別和自動(dòng)摘要系統(tǒng)均在各自的標(biāo)注語(yǔ)料中進(jìn)行了實(shí)驗(yàn),并取得了較為理想的效果。面向特定語(yǔ)義類別的新詞發(fā)現(xiàn)方面,本文考慮到對(duì)語(yǔ)料進(jìn)行類別標(biāo)注的成本較高,從同類新詞具有相似的上下文信息的角度出發(fā),設(shè)計(jì)了一種基于bootstrapping和軟模式匹配的新詞發(fā)現(xiàn)方法,根據(jù)語(yǔ)義特點(diǎn)將新詞拆分成多個(gè)部分,并根據(jù)新詞部分將新詞所在句子分割為多個(gè)槽,通過(guò)統(tǒng)計(jì)已標(biāo)注新詞和候選新詞各詞頻槽的詞向量相似度和詞頻向量相似度為候選新詞打分,并將評(píng)分較高的候選新詞加入已標(biāo)注新詞。本文在電子病歷語(yǔ)料中進(jìn)行了實(shí)驗(yàn),將癥狀新詞拆分成部位,性狀兩部分,癥狀新詞發(fā)現(xiàn)的F值達(dá)到了81.40%。面向ACE2005語(yǔ)料的事件類別識(shí)別和事件元素識(shí)別方面,本文在其他研究者基于支持向量機(jī)分類器的方法基礎(chǔ)上進(jìn)行了改進(jìn)。在事件類別識(shí)別中,本文根據(jù)同句中各個(gè)候選觸發(fā)詞的位置和觸發(fā)事件的信息,加入了一些和候選觸發(fā)詞和候選元素相關(guān)的特征,并優(yōu)化了文本信息預(yù)處理的方法;趲в惺录䴓(biāo)簽以及相應(yīng)的實(shí)體、時(shí)間、數(shù)值標(biāo)注的中英文ACE2005語(yǔ)料,本文衡量了事件類別識(shí)別和事件元素識(shí)別方法的效果,在事件元素識(shí)別中也加入了和實(shí)體、數(shù)值和時(shí)間標(biāo)簽相關(guān)的新特征。事件類別識(shí)別的F值達(dá)到了64.2%,事件元素識(shí)別的F值達(dá)到了63.7%。任務(wù)中,本文將TextRank算法和聚類方法結(jié)合起來(lái),利用BM25算法及多種句子相似度算法設(shè)置TextRank無(wú)向圖模型中的邊權(quán)重,并通過(guò)聚類方法嘗試減少自動(dòng)中的冗余信息,將句子和文檔間關(guān)系作為摘要提取的依據(jù)。系統(tǒng)在DUC2001以及DUC2002語(yǔ)料上進(jìn)行了多種長(zhǎng)度的單文檔和多文檔的實(shí)驗(yàn)并用ROUGE工具進(jìn)行了評(píng)測(cè),取得了較好效果。
[Abstract]:Text mining refers to information mining, meaning analysis, categorization and association analysis of text through computer, which can extract information and even knowledge that can be used by people from text. The development of information technology in the Internet industry and various industries provides abundant text corpus resources for text mining. At the same time, the accuracy, validity, operational efficiency and personalized level of text mining system are also required. Text mining requires the extraction of valuable information from pure text and provides a basis for the development of information technology, which belongs to a specific semantic category of new words, text event category, Text event elements and document abstracts are widely used text information. This paper studies and implements methods to solve several core problems in text mining, including new word discovery for specific semantic categories, event class recognition for ACE2005 corpus and event element recognition based on event category information. And for single document and multi-document automatic summary. Both the event recognition and automatic summarization systems have been experimented in their tagged corpus, and satisfactory results have been achieved. With regard to the discovery of new words for specific semantic categories, this paper takes into account the high cost of classifying the corpus, starting from the point of view that similar new words have similar contextual information. In this paper, a new word discovery method based on bootstrapping and soft pattern matching is designed. According to the semantic characteristics, the new word is divided into several parts, and the new word sentence is divided into multiple slots according to the new word part. Word vector similarity and word frequency vector similarity of tagged neologisms and candidate neologisms were counted as candidate neologisms, and tagged neologisms were added to tagged neologisms. In this paper, an experiment was carried out in the electronic medical record corpus. The symptom neologisms were divided into two parts, and the F value of symptom neologisms was 81.40g. In the aspect of event class recognition and event element recognition for ACE2005 corpus, this paper improves the method based on support vector machine classifier. According to the position of each candidate trigger word and the information of trigger event in the same sentence, this paper adds some features related to candidate trigger word and candidate element, and optimizes the method of text information preprocessing. Based on the Chinese and English ACE2005 corpus with event label and corresponding entity, time and value, this paper measures the effect of event class recognition and event element recognition, and also adds and entity to event element recognition. New features related to numerical and time labels. The F value of event category recognition is 64.2 and that of event element recognition is 63.7. In the task, we combine TextRank algorithm with clustering method, use BM25 algorithm and sentence similarity algorithm to set edge weight in TextRank undirected graph model, and try to reduce redundant information by clustering method. The relation between sentence and document is used as the basis of abstract extraction. The experiment of single document and multiple document on DUC2001 and DUC2002 corpus has been carried out and evaluated with ROUGE tool, and good results have been obtained.
【學(xué)位授予單位】:哈爾濱工業(yè)大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:TP391.1
【參考文獻(xiàn)】
相關(guān)期刊論文 前8條
1 曲春燕;關(guān)毅;楊錦鋒;趙永杰;劉雅欣;;中文電子病歷命名實(shí)體標(biāo)注語(yǔ)料庫(kù)構(gòu)建[J];高技術(shù)通訊;2015年02期
2 楊錦鋒;于秋濱;關(guān)毅;蔣志鵬;;電子病歷命名實(shí)體識(shí)別和實(shí)體關(guān)系抽取研究綜述[J];自動(dòng)化學(xué)報(bào);2014年08期
3 蔣志鵬;趙芳芳;關(guān)毅;楊錦鋒;;面向中文電子病歷的詞法語(yǔ)料標(biāo)注研究[J];高技術(shù)通訊;2014年06期
4 徐永東;權(quán)光日;王亞?wèn)|;;基于HL7的電子病歷關(guān)鍵信息抽取技術(shù)研究[J];哈爾濱工業(yè)大學(xué)學(xué)報(bào);2011年11期
5 胡俠;林曄;王燦;林立;;自動(dòng)文本摘要技術(shù)綜述[J];情報(bào)雜志;2010年08期
6 趙妍妍;秦兵;車萬(wàn)翔;劉挺;;中文事件抽取技術(shù)研究[J];中文信息學(xué)報(bào);2008年01期
7 賀敏;龔才春;張華平;程學(xué)旗;;一種基于大規(guī)模語(yǔ)料的新詞識(shí)別方法[J];計(jì)算機(jī)工程與應(yīng)用;2007年21期
8 鄒綱,劉洋,劉群,孟遙,于浩,西野文人,亢世勇;面向Internet的中文新詞語(yǔ)檢測(cè)[J];中文信息學(xué)報(bào);2004年06期
相關(guān)會(huì)議論文 前1條
1 趙妍妍;王嘯吟;秦兵;車萬(wàn)翔;劉挺;;中文事件抽取中事件類別的自動(dòng)識(shí)別[A];第三屆學(xué)生計(jì)算語(yǔ)言學(xué)研討會(huì)論文集[C];2006年
相關(guān)博士學(xué)位論文 前1條
1 張玉龍;疾病的價(jià)值研究[D];山東大學(xué);2012年
相關(guān)碩士學(xué)位論文 前1條
1 張立邦;基于半監(jiān)督學(xué)習(xí)的中文電子病歷分詞和名實(shí)體挖掘[D];哈爾濱工業(yè)大學(xué);2014年
,本文編號(hào):1830655
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1830655.html