印刷體蒙古文文檔中多文種識別技術的研究與實現(xiàn)
發(fā)布時間:2018-03-06 03:23
本文選題:蒙古文 切入點:文檔圖像 出處:《內(nèi)蒙古大學》2017年碩士論文 論文類型:學位論文
【摘要】:目前,能識別單一文種的文字識別系統(tǒng)(OCR)有很多。但是,在全球一體化的趨勢下,文檔中出現(xiàn)了多種不同的文字。在現(xiàn)存的一些蒙古文文檔中不只包括蒙古文,還會混有一定數(shù)量的漢文與英文。因此,設計一個多文種識別系統(tǒng)是十分必要的。本文提出的多文種識別技術分為文檔圖像預處理和文種識別兩個過程。文檔圖像預處理的過程為:首先,將文本區(qū)域和圖像區(qū)域分離,提取出文本區(qū)域;然后,對文本區(qū)域進行段落劃分;隨后,運用垂直投影和高斯平滑進行列切分,獲得文字列;最后,運用連通域分析方法實現(xiàn)字切分。在預處理階段,本文對每個文字圖像在原文檔圖像的坐標位置進行了記錄,以便版面恢復。本文提出的蒙漢英多文種識別技術包括粗分類與細分類兩個階段。在粗分類階段,依據(jù)文字圖像的寬度、高度等信息進行分類,將所有文字圖像粗略的分為蒙古文類、漢文類和英文類,漢文類中除了漢文,還混有一定量的英文和蒙古文,英文類中除了英文,還混有一定量的漢文和蒙古文,因此,還需進一步分類。在細分類階段,根據(jù)粗分類的結果,對漢文類、英文類以及標點符號/英文/數(shù)字類分別使用卷積神經(jīng)網(wǎng)絡(CNN)進行細分類。在實驗數(shù)據(jù)集上進行測試,預處理階段中的列切分正確率達99.13%,字切分正確率達97.87%;在細分類階段,本文所提的細分類方法對漢文細分類的平均識別正確率達99.41%,對英文細分類的平均識別正確率達98.86%,對標點/英文/數(shù)字細分類的平均識別正確率達98.34%。
[Abstract]:At present, there are many OCRs that can recognize a single language. However, in the trend of global integration, there are many different characters in the documents. There is also a certain amount of Chinese and English. So, It is necessary to design a multi-language recognition system. The multi-language recognition technology proposed in this paper is divided into two processes: document image preprocessing and document recognition. The process of document image preprocessing is as follows: firstly, the text region is separated from the image region. Extract the text area; then, divide the text area into paragraphs; then, use vertical projection and Gao Si smooth column segmentation, get the text column; finally, use the connected domain analysis method to achieve word segmentation. In this paper, the coordinate position of each text image in the original document image is recorded in order to restore the layout. The multilingual recognition technology of Mongolian, Chinese and English proposed in this paper includes two stages: coarse classification and fine classification. According to the width and height of the text image, all the text images are roughly classified into Mongolian, Chinese and English. In Chinese, in addition to Chinese, there is also a certain amount of English and Mongolian, and English is the exception of English. There is also a certain amount of Chinese and Mongolian, so further classification is needed. In the detailed classification stage, according to the results of rough classification, English class and punctuation / English / digital class are subdivided by convolution neural network (CNN) respectively. The experimental data set is tested. The accuracy rate of column segmentation and word segmentation is 99.13 and 97.87 respectively in the preprocessing stage, and in the subdivision stage, the accuracy of column segmentation is 99.13 and that of word segmentation is 97.87. The average recognition accuracy of the proposed method is 99.41 for Chinese subclassification, 98.86 for English subclassification and 98.34 for punctuation / English / digital subclassification.
【學位授予單位】:內(nèi)蒙古大學
【學位級別】:碩士
【學位授予年份】:2017
【分類號】:TP391.4
【參考文獻】
相關期刊論文 前10條
1 蔡娟;蔡堅勇;廖曉東;黃海濤;丁僑俊;;基于卷積神經(jīng)網(wǎng)絡的手勢識別初探[J];計算機系統(tǒng)應用;2015年04期
2 沈夏炯;王晶晶;范家銘;周兵;;MGSI-8CA標記算法[J];計算機工程與應用;2013年20期
3 徐姍姍;劉應安;徐f;;基于卷積神經(jīng)網(wǎng)絡的木材缺陷識別[J];山東大學學報(工學版);2013年02期
4 李全喜;;充分利用蒙古文圖書資料努力構筑“精神家園”[J];內(nèi)蒙古師范大學學報(哲學社會科學版);2013年02期
5 范會敏;王浩;;模式識別方法概述[J];電子設計工程;2012年19期
6 楊亞威;李俊山;楊威;趙方舟;;利用稀疏化生物視覺特征的多類多視角目標檢測方法[J];紅外與激光工程;2012年01期
7 呂剛;;基于卷積神經(jīng)網(wǎng)絡的多字體字符識別[J];浙江師范大學學報(自然科學版);2011年04期
8 郭俊平;王福;;蒙古文文獻資源數(shù)字化共建共享的研究[J];四川圖書館學報;2011年05期
9 童立靖;張艷;舒巍;占國亮;錢W,
本文編號:1573133
本文鏈接:http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/1573133.html
最近更新
教材專著