漢語(yǔ)依存圖庫(kù)的構(gòu)建
發(fā)布時(shí)間:2018-02-11 02:26
本文關(guān)鍵詞: 句法語(yǔ)義 依存語(yǔ)法 圖結(jié)構(gòu) 標(biāo)注 圖庫(kù) 出處:《南京師范大學(xué)》2015年碩士論文 論文類型:學(xué)位論文
【摘要】:汁算機(jī)自然語(yǔ)言處理需要從線性的句子中獲取詞語(yǔ)之間的語(yǔ)義關(guān)系,樹(shù)形的句法結(jié)構(gòu)可以推導(dǎo)出句子成分之間主要的語(yǔ)義關(guān)系,在自然語(yǔ)言處理中起著重要作用,但隨著近年來(lái)語(yǔ)料庫(kù)建設(shè)規(guī)模的不斷擴(kuò)大,研究者發(fā)現(xiàn)用投影樹(shù)無(wú)法完整地描寫(xiě)句法結(jié)構(gòu),并且還發(fā)現(xiàn)有相當(dāng)數(shù)量的非投影樹(shù)結(jié)構(gòu)和圖結(jié)構(gòu)。同時(shí)由于漢語(yǔ)自身的特點(diǎn),長(zhǎng)期以來(lái),漢語(yǔ)句法分析精度較低,現(xiàn)有的句法分析技術(shù)不適合處理漢語(yǔ)中的一些特殊句式(連動(dòng)句、兼語(yǔ)句、動(dòng)詞拷貝、長(zhǎng)句等),,亟需尋找新的技術(shù)手段解決非這一難題。一些研究者提出了AMR這種基于圖的句子語(yǔ)義表示方法,用來(lái)分析英語(yǔ)。本文則嘗試借鑒這一方法來(lái)探究基于依存語(yǔ)法對(duì)漢語(yǔ)進(jìn)行句法語(yǔ)義一體化標(biāo)注(簡(jiǎn)稱依存圖標(biāo)注),講而構(gòu)建漢語(yǔ)依存圖庫(kù)。本文的主要內(nèi)容如下:第一步,梳理并分析了句法理論和句法結(jié)構(gòu)表示方法的發(fā)展過(guò)程,在這個(gè)過(guò)程中發(fā)現(xiàn)在句法分析和論元分析的過(guò)程中經(jīng)常出現(xiàn)了超出樹(shù)結(jié)構(gòu)的現(xiàn)象,這是引進(jìn)圖結(jié)構(gòu)的一個(gè)重要原因,然后,進(jìn)一步統(tǒng)計(jì)分析CoNLL2009評(píng)測(cè)的中文數(shù)據(jù),結(jié)果表明了根據(jù)樹(shù)結(jié)構(gòu)難以推導(dǎo)出所有的語(yǔ)義結(jié)構(gòu),這就需要探索漢語(yǔ)句子的基于圖的句法語(yǔ)義一體化標(biāo)注新方案;第二步,基于以上的理論準(zhǔn)備,通過(guò)實(shí)際標(biāo)注和反復(fù)的驗(yàn)證修改,逐步構(gòu)建出基于依存圖標(biāo)注的標(biāo)記集體系和具體的標(biāo)注規(guī)范,這也是本研究的創(chuàng)新之處:第三步是實(shí)際操作部分,使用第二步確定的標(biāo)記集和標(biāo)注規(guī)范對(duì)已有的CoNLL2009評(píng)測(cè)的中文數(shù)據(jù)中的一部分?jǐn)?shù)據(jù)進(jìn)行依存圖標(biāo)注,一共標(biāo)注了1230句,并記錄了標(biāo)注過(guò)程中遇到的一些問(wèn)題;第四步則是對(duì)第三步的標(biāo)注結(jié)果進(jìn)行統(tǒng)計(jì)和分析,統(tǒng)計(jì)發(fā)現(xiàn)在標(biāo)注好的1230句的語(yǔ)料中形成圖結(jié)構(gòu)的句子有795句,占到語(yǔ)料的64.6%。這部分就主要分析了標(biāo)注中形成圖結(jié)構(gòu)的一些特殊的語(yǔ)言現(xiàn)象,例如,兼語(yǔ)句、連動(dòng)句、二價(jià)名詞等,對(duì)這些特殊殊子的樸理正是依存圖相對(duì)干依存樹(shù)的優(yōu)勢(shì)所在,也是構(gòu)建依存圖庫(kù)的關(guān)鍵所在。本文的創(chuàng)新之處在于,首先是提出用圖結(jié)構(gòu)來(lái)表示漢語(yǔ)句法語(yǔ)義分析結(jié)果;其次是提出一套新的漢語(yǔ)句法語(yǔ)義一體化標(biāo)注的標(biāo)記集合標(biāo)注規(guī)范,另外還將依存語(yǔ)法和框架語(yǔ)義學(xué)結(jié)合起來(lái)對(duì)漢語(yǔ)進(jìn)行分析。本文通過(guò)逐步的研究、分析發(fā)現(xiàn),漢語(yǔ)中存在一定數(shù)量的需要用圖結(jié)構(gòu)表示才能完全揭示其句法語(yǔ)義關(guān)系的句子,這類句子往往就是影響漢語(yǔ)句法分析精度的夫鍵;而標(biāo)注的實(shí)際操作過(guò)程和統(tǒng)計(jì)分析的結(jié)果也證明了,圖結(jié)構(gòu)相對(duì)于樹(shù)結(jié)構(gòu)在揭示句子句法語(yǔ)義關(guān)系方面有明顯的優(yōu)勢(shì)。
[Abstract]:Juicing machine natural language processing needs to obtain the semantic relationship between words from linear sentences. The tree syntax structure can deduce the main semantic relations among sentence components, and it plays an important role in natural language processing. However, with the expansion of corpus construction in recent years, researchers have found that projective trees can not describe syntactic structures completely, and that there are quite a number of non-projective tree structures and graph structures. For a long time, the accuracy of Chinese syntactic analysis has been low, and the existing syntactic analysis techniques are not suitable for dealing with some special sentence patterns in Chinese. It is urgent to find new technical means to solve this problem. Some researchers have proposed AMR, a graph-based semantic representation of sentences. This paper tries to use this method for reference to explore the syntactic and semantic integration tagging of Chinese based on dependency grammar. The main contents of this paper are as follows: first, This paper analyzes the development of syntactic theory and syntactic structure representation. In this process, it is found that in the process of syntactic analysis and argument analysis, there are phenomena beyond tree structure, which is an important reason for the introduction of graph structure. Then, further statistical analysis of the Chinese data assessed by CoNLL2009 shows that it is difficult to deduce all semantic structures according to tree structure, so we need to explore a new scheme of syntactic and semantic integration tagging based on graph in Chinese sentences. Based on the above theoretical preparation, through practical annotation and repeated verification and modification, a label set system and specific label specification based on dependency graph annotation are constructed step by step. This is also the innovation of this study: the third step is the practical operation part. The second step is used to determine the mark set and label specification to annotate some of the existing Chinese data evaluated by CoNLL2009. A total of 1230 sentences are annotated, and some problems encountered in the process of annotation are recorded. The 4th step is a statistical analysis of the result of the third step. The statistics show that there are 795 sentences in the tagged 1230 sentence corpus that form the graph structure. This part mainly analyzes some special linguistic phenomena that form the graph structure in the tagging, such as concurrent sentences, continuous sentences, bivalent nouns, etc. It is the advantage of dependency graph relative to dry dependency tree and the key to construct dependency graph library. The innovation of this paper lies in that, first of all, graph structure is proposed to represent the result of syntactic and semantic analysis in Chinese. Secondly, we propose a new set of tagging specifications for Chinese syntactic and semantic tagging. In addition, we combine dependency grammar and frame semantics to analyze Chinese. There are a certain number of sentences in Chinese that need to be represented by graph structure to fully reveal their syntactic and semantic relations. These sentences are often the keys that affect the accuracy of Chinese syntactic analysis. The actual operation process and statistical analysis also prove that graph structure has obvious advantages over tree structure in revealing syntactic and semantic relations of sentences.
【學(xué)位授予單位】:南京師范大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2015
【分類號(hào)】:H146
【參考文獻(xiàn)】
相關(guān)期刊論文 前1條
1 游汝杰;現(xiàn)代漢語(yǔ)兼語(yǔ)句的句法和語(yǔ)義特征[J];漢語(yǔ)學(xué)習(xí);2002年06期
本文編號(hào):1502001
本文鏈接:http://sikaile.net/wenyilunwen/yuyanxuelw/1502001.html
最近更新
教材專著