天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

當(dāng)前位置:主頁 > 碩博論文 > 信息類博士論文 >

中文微博文本規(guī)范化方法及關(guān)鍵技術(shù)研究

發(fā)布時(shí)間:2018-04-01 03:27

  本文選題:中文微博 切入點(diǎn):文本規(guī)范化 出處:《武漢大學(xué)》2016年博士論文


【摘要】:近些年,微博由于其短文本性、即時(shí)性和裂變式傳播特性,已成為當(dāng)前最重要的社交網(wǎng)絡(luò)媒體之一。它亦成為人類獲取新聞時(shí)事、人際交往、自我表達(dá)、社會分享以及社會參與的重要媒介及社會公共輿論、企業(yè)品牌和產(chǎn)品推廣、傳統(tǒng)媒體傳播的重要平臺。然而由于微博文本存在大量的非規(guī)范詞現(xiàn)象,使得傳統(tǒng)的自然語言工具在處理微博文本時(shí)性能較低。因此文本規(guī)范化已成為微博文本分析的一個(gè)重要預(yù)處理過程。不同于英文非規(guī)范詞通常屬于詞典外的詞,中文非規(guī)范詞形式更加復(fù)雜,如語音替換、縮寫、釋義和新詞等。本文主要研究基于中文微博文本的規(guī)范化。傳統(tǒng)的方法通常把非規(guī)范詞看作是一個(gè)拼寫錯(cuò)誤,采用噪音模型或翻譯模型來進(jìn)行規(guī)范化。另一些方法嘗試從語義的角度來研究文本規(guī)范化,但仍面臨著一些關(guān)鍵挑戰(zhàn)。本文根據(jù)中文微博文本的語言特點(diǎn),研究了中文微博文本規(guī)范化所面臨的三個(gè)關(guān)鍵問題:非規(guī)范詞詞義學(xué)習(xí)、非規(guī)范詞與規(guī)范詞對關(guān)系挖掘和文本規(guī)范化與分詞聯(lián)合處理。具體工作如下:一、基于詞匯鏈超圖的詞義歸納模型微博中非規(guī)范詞大部分表示為新的詞義,識別非規(guī)范詞可以看作是一個(gè)消歧任務(wù),但傳統(tǒng)的詞典顯然已不能滿足要求,關(guān)鍵是如何從微博文本中學(xué)習(xí)或歸納微博詞義。詞義歸納是一個(gè)非監(jiān)督任務(wù),目的是從大規(guī)范文本中歸納目標(biāo)單詞的詞義。本文提出一個(gè)基于詞匯鏈超圖的詞義歸納模型。該模型采用詞匯鏈表示目標(biāo)單詞的多實(shí)例間高階語義關(guān)系,然后利用詞匯鏈來構(gòu)建超圖模型。該模型從全局的角度抓住了復(fù)雜的高階語義關(guān)系。實(shí)驗(yàn)結(jié)果顯示本文所提模型的有效性。此外實(shí)驗(yàn)顯示了詞匯鏈對系統(tǒng)性能的影響,且顯示單詞的詞義數(shù)目及語義粒度對詞義歸納系統(tǒng)的性能有較大影響。二、基于嵌入表示學(xué)習(xí)的非規(guī)范詞-規(guī)范詞對關(guān)系挖掘非規(guī)范詞通常有固定的規(guī)范詞與之對應(yīng),構(gòu)建非規(guī)范詞典有助于文本規(guī)范化。其關(guān)鍵是如何從大規(guī)模微博文本中挖掘出非規(guī)范詞-規(guī)范詞對關(guān)系。假設(shè)非規(guī)范詞與規(guī)范詞具有相同的詞義,本文提出一個(gè)基于嵌入表示的多詞義學(xué)習(xí)模型,該模型克服了傳統(tǒng)多詞嵌入表示中不同詞的詞義表示是相互獨(dú)立的,提出在學(xué)習(xí)全局的多詞義嵌入表示方法時(shí),同時(shí)學(xué)習(xí)出同義關(guān)系。該模型通過引入窗口位置信息,有效的解決了表示偏差問題。利用該模型,采用過濾和分類等后處理,提出一個(gè)從大規(guī)模微博語料中挖掘非規(guī)范詞-規(guī)范詞對關(guān)系的框架。實(shí)驗(yàn)結(jié)果顯示該方法的有效性。三、聯(lián)合分詞、詞性標(biāo)注和文本規(guī)范化模型本文探索文本規(guī)范化及其應(yīng)用研究。針對中文微博存在分詞問題,提出一個(gè)聯(lián)合分詞、詞性標(biāo)注和文本規(guī)范化模型。該模型在基于遷移的聯(lián)合分詞與詞性標(biāo)注模型的基礎(chǔ)上,通常增加遷移行為來對文本進(jìn)行規(guī)范化。分詞在規(guī)范的文本中進(jìn)行,而好的分詞有助于發(fā)現(xiàn)非規(guī)范化詞,從而有利于規(guī)范化。該模型能有效利用標(biāo)準(zhǔn)的標(biāo)注語料進(jìn)行訓(xùn)練,克服了缺少語料的問題。使用兩類特征對模型打分,其中規(guī)范文本特征可作為公共特征,非規(guī)范化文本作為域特征,自然的實(shí)現(xiàn)了特征擴(kuò)充,使該模型具有較好的域適應(yīng)性。實(shí)驗(yàn)結(jié)果顯示,聯(lián)合模型能使三個(gè)任務(wù)彼此受益,且語言統(tǒng)計(jì)特征有助于提高它們的性能。
[Abstract]:In recent years, micro-blog because of its short nature, immediacy and fission propagation characteristics, has become one of the most important social media. It has become the people to get news, interpersonal communication, self expression, social sharing and social participation in the media and social public opinion, the enterprise brand and product promotion, an important platform for traditional media spread. However due to the existence of non standard word micro-blog text phenomena, the traditional natural language processing tools in micro-blog text when performance is low. So the text standardization has become an important pretreatment process of micro-blog text analysis. Different from English non-standard words usually belong to the dictionary words, Chinese non-standard the word form is more complex, such as voice substitution, abbreviations, definitions and words. This paper mainly studies the Chinese micro-blog text based on the specification. The traditional method is usually non standard word As a spelling error, the noise model or translation model to carry out standardization. Some other methods to try to study the text normalization from the semantic point of view, but still faces some key challenges. Based on the linguistic features of text Chinese micro-blog, on three key issues facing the Wei Bowen Chinese Standardization: non standard words learning, nonstandard words and standard word on the relationship between mining and text normalization and segmentation processing. The specific work is as follows: first, the meaning of lexical chain hypergraph inductive model micro-blog non standard word most represented as new meanings based on the identification of non standard words can be regarded as a disambiguation task. But the traditional dictionary cannot satisfy the demand, the key is how to learn from micro-blog or micro-blog word summarized in the text. The meaning of induction is an unsupervised task, the purpose of this is summed up from the standard text Target words. This paper presents a model of inductive lexical chain hypergraph based on semantic. The model adopts the lexical chain multiple instances of a target word between the higher-order semantic relations, and then to a hypergraph model using lexical chains. The model captures the high order complex semantic relations from a global perspective. The experimental results show the effectiveness the model proposed in this paper. In addition the experiment demonstrates the effect of lexical chain on the performance of the system, and has great influence on performance of the meaning of number and word semantic granularity on lexical induction system. Two, embedded learning non canonical word representation standard word non-standard words usually have a fixed standard word corresponding to mining based on the construction of non standard dictionary help text normalization. The key is how to extract the text from the massive micro-blog non standard word - Specification of the relationship. Assuming that the non standard word and standard word has The same meaning, this paper proposes a representation based on embedded multi word learning model, this model overcomes the shortcomings of traditional multi word representation in different embedded word representation is independent of each other, said the proposed method in learning global multi meaning embedded, at the same time learn synonyms. The model introduces the window position information, effective to solve the problem that deviation. With this model, the filtering and classification of postprocessing, propose a large-scale corpus from micro-blog mining non standard word - Specification of the relationship framework. Experimental results show that this method is effective. Three, combined with research on text segmentation, normalization and application of this model to explore POS tagging and text specification. For Chinese micro-blog word segmentation problem, propose a joint word segmentation, POS tagging and text normalization model. In this model, combined with word segmentation and part of speech based on migration Based on the annotation model, usually to increase the migration behavior of standardization of the text segmentation. In standard text, and good segmentation is helpful to find the non standardized word, which is conducive to standardization. This model can effectively use the standard corpus for training, overcome the lack of data. The use of two kinds of feature scoring model, which can be used as a standard text features public features, non standardized text as the natural characteristics, realize the feature expansion, so that the model has better adaptability domain. Experimental results show that the combined model can make the three tasks benefit from each other, and the statistical characteristics of language helps to improve their performance.

【學(xué)位授予單位】:武漢大學(xué)
【學(xué)位級別】:博士
【學(xué)位授予年份】:2016
【分類號】:TP391.1

【相似文獻(xiàn)】

相關(guān)期刊論文 前10條

1 郭飛飛;王小華;諶志群;王榮波;;基于回應(yīng)消息的中文微博情感分類方法[J];杭州電子科技大學(xué)學(xué)報(bào);2013年06期

2 李赫元;俞曉明;劉悅;程學(xué)旗;程工;;中文微博客的垃圾用戶檢測[J];中文信息學(xué)報(bào);2014年03期

3 文坤梅;徐帥;李瑞軒;辜希武;李玉華;;微博及中文微博信息處理研究綜述[J];中文信息學(xué)報(bào);2012年06期

4 王銀;吳新玲;;中文微博情感分析方法研究[J];廣東技術(shù)師范學(xué)院學(xué)報(bào);2014年03期

5 肖s,

本文編號:1693767


資料下載
論文發(fā)表

本文鏈接:http://sikaile.net/shoufeilunwen/xxkjbs/1693767.html


Copyright(c)文論論文網(wǎng)All Rights Reserved | 網(wǎng)站地圖 |

版權(quán)申明:資料由用戶74bb7***提供,本站僅收錄摘要或目錄,作者需要刪除請E-mail郵箱bigeng88@qq.com