英語文章語法自動檢查及糾正的研究與實現(xiàn)
本文選題:語法自動檢查及糾正 + 語料庫; 參考:《北京郵電大學(xué)》2016年碩士論文
【摘要】:隨著世界一體化進(jìn)程的逐漸加深,英語作為世界通用語言受到學(xué)習(xí)者更加廣泛的重視。在聽、說、讀、寫這四項英語學(xué)習(xí)的基本技能中,寫作被認(rèn)為是應(yīng)用性最強(qiáng)、綜合知識面最廣、訓(xùn)練難度最大的部分。同時,對于英語為第二語言的學(xué)習(xí)者來說,由于文化、思維的差異,以及受到母語本身的影響,語法錯誤是寫作中最常見也是最難解決的問題之一。英語文章語法自動檢查及糾正主要利用自然語言處理領(lǐng)域的相關(guān)技術(shù)并結(jié)合機(jī)器學(xué)習(xí)方法,讓計算機(jī)能夠自動判斷英文句子中是否存在語法錯誤,并對其進(jìn)行糾正。本文提出了一種基于語料庫的規(guī)則自動抽取方法進(jìn)行規(guī)則自動獲取,在此基礎(chǔ)上提出了基于語料庫的有限回退策略的英語文章語法錯誤檢查及糾正算法來對英語文章進(jìn)行語法自動檢查及糾正。首先通過爬蟲獲取大量的英語文本,并經(jīng)過文本清洗、斷句、詞性標(biāo)注等預(yù)處理后建立索引,搭建了一個可供實時查詢的語料庫,然后結(jié)合訓(xùn)練集,通過上述規(guī)則自動抽取方法,獲得錯誤的語法規(guī)則,基于有限回退策略,對檢查出的語法錯誤候選進(jìn)行糾正。該方法在2013年CoNLL語法自動檢查及糾正評測數(shù)據(jù)上總體F1為0.3196,超過第一名的0.3120,并且在針對冠詞錯誤的糾正方面F1為0.3345,超過2013年最好成績0.3340,在針對名詞錯誤的糾正方面F1為0.4531,超過2013年最好成績0.4435,實驗結(jié)果表明本文提出的方法對語法錯誤的檢查及糾正有效。本文的主要貢獻(xiàn)如下:1.提出了一種利用訓(xùn)練集和語料庫自動抽取語法規(guī)則的方法,并利用CoNLL2013訓(xùn)練集抽取了 41278條規(guī)則。由于人工書寫語法規(guī)則費時費力,并且可能不完善,同時人工書寫的規(guī)則對ESL用戶的語法錯誤不具有針對性,而利用自動語法規(guī)則抽取方法能有效的解決此問題。2.提出了基于單詞和詞性混合查詢的搜索方式,并搭建了可供實時查詢的語料庫,包括16618045條來源于紐約時報、批改網(wǎng)學(xué)生作文以及CoNLL2013訓(xùn)練集的句子。該語料庫可以提供單詞、詞組、詞性以及單詞與詞性的混合搜索,為本文利用語料庫抽取錯誤語法規(guī)則,以及后續(xù)的語法自動檢查及糾正提供搜索保障。3.提出利用知識庫對文本過濾的方法,降低語法錯誤檢查對固定搭配的誤判率,并搭建了 一個為語法錯誤檢查糾正提供服務(wù)的固定搭配列表。在語法錯誤自動檢查及糾正過程中,極容易忽略符合語言習(xí)慣但不一定符合語法的固定搭配,使得系統(tǒng)的準(zhǔn)確率降低,因此本文利用固定搭配列表過濾的方式來降低系統(tǒng)的誤判率。4.提出了一個基于語料庫的有限回退策略的英語文章語法錯誤檢查及糾正算法,來進(jìn)行語法自動檢查及糾正。該算法將回退過程與窗口大小相關(guān)聯(lián),更加精細(xì)的控制整個回退過程,使得整個系統(tǒng)的性能有明顯提升。
[Abstract]:With the deepening of the process of world integration, English as a universal language has attracted more and more attention from learners. Among the four basic skills of listening, speaking, reading and writing, writing is considered to be the most applicable, comprehensive and difficult part. At the same time, for EFL learners, grammatical errors are one of the most common and difficult problems in writing because of the differences in culture, thinking and the influence of their mother tongue. The automatic checking and correcting of English grammar mainly use the related techniques in the field of natural language processing and the method of machine learning, so that the computer can automatically judge whether there are grammatical errors in English sentences and correct them. In this paper, a method of automatic rule extraction based on corpus is proposed. On this basis, a corpus-based algorithm for checking and correcting grammatical errors of English articles is proposed to automatically check and correct the grammar of English articles. Firstly, a large amount of English text is obtained by crawler, and then the index is built after pretreatment such as text cleaning, breakage and part of speech tagging, and a corpus is built for real-time query, and then the training set is combined. Through the automatic extraction of the above rules, the error syntax rules are obtained, and the checked syntax error candidates are corrected based on the finite fallback strategy. In 2013, the total F1 is 0.3196, which is more than 0.3120 in the first place, and the F1 is 0.3345 in correcting the error of article, which is higher than the best score of 0.3340 in 2013. Face F1 is 0.4531, which exceeds the best score in 2013 by 0.4435. The experimental results show that the method proposed in this paper is effective in checking and correcting grammatical errors. The main contributions of this paper are as follows: 1. A method of automatically extracting grammar rules from training set and corpus is proposed, and 41278 rules are extracted by using the training set of CoNLL2013. Because manual writing grammar rules are time-consuming and laborious, and may not be perfect, the manual writing rules have no pertinence to the syntax errors of ESL users. However, the automatic grammar rule extraction method can effectively solve this problem. This paper proposes a search method based on word and part of speech query, and builds a corpus for real-time query, including 16618045 sentences from the New York Times, correction of students' compositions and training set of CoNLL2013. This corpus can provide the search for words, phrases, parts of speech and the mixture of words and parts of speech, which provides the search guarantee for extracting the wrong grammar rules by using the corpus, and for the subsequent automatic checking and correcting of the grammar. A method of text filtering based on knowledge base is proposed to reduce the error rate of grammatical error checking for fixed collocations, and a list of fixed collocations to provide services for grammatical error checking and correction is set up. In the process of automatic checking and correcting grammatical errors, it is easy to ignore the fixed collocation that conforms to the language habit but not necessarily the grammar, so that the accuracy of the system is reduced. Therefore, this paper uses fixed collocation list filtering to reduce the error rate of the system. 4. This paper presents an algorithm for checking and correcting grammatical errors in English articles based on a corpus-based finite fallback strategy to carry out automatic grammar checking and correction. The algorithm correlates the fallback process with the window size, and controls the whole fallback process more finely, so that the performance of the whole system can be improved obviously.
【學(xué)位授予單位】:北京郵電大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2016
【分類號】:TP391.1
【參考文獻(xiàn)】
相關(guān)期刊論文 前10條
1 袁昌萬;金雙軍;;基于語料庫大數(shù)據(jù)的英語寫作實證研究[J];重慶交通大學(xué)學(xué)報(社會科學(xué)版);2015年04期
2 楊莉;;基于語料庫的大學(xué)英語寫作教學(xué)研究——以句酷批改網(wǎng)為例[J];時代文學(xué)(下半月);2015年03期
3 吳偉成;周俊生;曲維光;;基于統(tǒng)計學(xué)習(xí)模型的句法分析方法綜述[J];中文信息學(xué)報;2013年03期
4 董喜雙;關(guān)毅;;基于有監(jiān)督學(xué)習(xí)的依存句法分析模型綜述[J];智能計算機(jī)與應(yīng)用;2013年02期
5 馬立東;;編輯距離算法及其在英語易混詞自動抽取中的應(yīng)用[J];智能計算機(jī)與應(yīng)用;2013年01期
6 張楊;;如何提高學(xué)生英語寫作水平[J];黑龍江教育學(xué)院學(xué)報;2012年05期
7 方宗祥;;英語名詞在中國語境下的本土化現(xiàn)象——“propaganda”個案研究[J];外語學(xué)刊;2012年02期
8 孫立偉;何國輝;吳禮發(fā);;網(wǎng)絡(luò)爬蟲技術(shù)的研究[J];電腦知識與技術(shù);2010年15期
9 葉舟;王東;;基于規(guī)則引擎的數(shù)據(jù)清洗[J];計算機(jī)工程;2006年23期
10 張仰森;曹元大;俞士汶;;基于規(guī)則與統(tǒng)計相結(jié)合的中文文本自動查錯模型與算法[J];中文信息學(xué)報;2006年04期
相關(guān)博士學(xué)位論文 前1條
1 劉磊;面向自動語法檢查的依存規(guī)則研究[D];北京外國語大學(xué);2014年
相關(guān)碩士學(xué)位論文 前1條
1 張璇;新聞報道中中國英語句法結(jié)構(gòu)特征的量化研究[D];廣西師范大學(xué);2004年
,本文編號:2015206
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2015206.html