基于特定發(fā)音單元的音視頻信息一致性評估方法及實現(xiàn)
發(fā)布時間:2018-11-20 14:37
【摘要】:我國老齡化日益嚴重,社保金的發(fā)放面臨日益嚴峻的冒領等欺詐問題,合適的受益人的身份認證問題日益突出;時時報道大型音樂會存在假唱問題,但又拿不出確實的證據(jù),有必要對疑似假唱進行檢測;動漫產(chǎn)業(yè)是國家鼓勵的低碳產(chǎn)業(yè),,動漫配音質量也缺乏客觀評價技術。由于真實語音是由人的發(fā)音器官產(chǎn)生,語音信號與唇動信息存在嚴格的一致性。本文從音視頻一致性分析方面著手探討基于語音的身份認證語音樣本的真實性,提高對社保金受益人身份認證的準確性,有效防止冒領。同時也為配音質量客觀評價和假唱等問題的解決提供技術基礎。 本文提出基于特定發(fā)音單元的音視頻一致性分析方法,基本分析算法為協(xié)慣量分析算法(Co-inertia analysis,CoIA),關聯(lián)視頻中的語音和唇部運動,分析視頻中的語音與唇動的一致性。分為訓練階段和測試分析階段,訓練階段分別對音頻和唇動視頻中的唇部圖像提取特征,求兩者的映射矩陣;測試分析階段將特征投影在映射矩陣,投影值的協(xié)方差均值即為所求的相關系數(shù)。協(xié)慣量分析的相關系數(shù)越大,說明音頻與視頻越相關;等錯誤率越小,說明音視頻一致性的評估性能越好。實驗表明協(xié)慣量分析算法的音視頻一致性分析性能較優(yōu)。 對特定發(fā)音單元的選取以及基于特定發(fā)音單元的音視頻一致性分析是本文的創(chuàng)新點,選取句中特定的發(fā)音單元替代句子進行音視頻一致性分析。本文首先分析漢語聲韻母的視位口型特征,以視位的口型特征相似為依據(jù)對韻母進行聚類,將有相同口型參數(shù)的韻母聚為一類,一共將韻母聚類為16類。其次通過協(xié)慣量分析算法選取相關系數(shù)較高的發(fā)音單元類別作為特定發(fā)音單元,并通過實驗構造一致與不一致數(shù)據(jù)進行一致性分析,驗證選取的特定發(fā)音單元的合理性。最后對整句與從整句中提取的特定發(fā)音單元進行音視頻一致性對比分析,作基于特定音節(jié)的聚類與整句的對比分析。實驗數(shù)據(jù)庫采用350個整句,其中一個整句時長約3秒至10秒之間,對每一整句通過能量、過零率以及基頻提取和識別7組特定發(fā)音單元,一個特定發(fā)音單元的時長約為0.3秒至0.8秒,從350個句子中提取特定發(fā)音單元樣本,提取的特定發(fā)音單元的時長約比整句的減少四分之三,減少運算數(shù)據(jù)量。分別對特定發(fā)音單元和整句通過協(xié)慣量分析算法進行音視頻一致性分析,實驗結果顯示特定發(fā)音單元一致性評估等錯誤率比整句降低2.7%。
[Abstract]:The aging of our country is becoming more and more serious, the payment of social security funds is facing more and more serious fraud problems, such as fraud, and the problem of identity authentication of suitable beneficiaries is becoming more and more prominent. It is often reported that there are lip-synching problems in large-scale concerts, but there is no real evidence. It is necessary to detect the suspected lip-synching; animation industry is a low-carbon industry encouraged by the state, and the quality of animation dubbing is also lack of objective evaluation technology. Because real speech is produced by human phonetic organs, there is strict consistency between speech signal and lip movement information. From the aspect of audio and video consistency analysis, this paper discusses the authenticity of voice samples based on voice identity authentication, improves the accuracy of identity authentication for beneficiaries of social security funds, and effectively prevents fraud. At the same time, it also provides the technical basis for the objective evaluation of dubbing quality and the solution of lip-synching problems. In this paper, a method of audio and video consistency analysis based on specific pronunciation unit is proposed. The basic analysis algorithm is coinertia analysis algorithm (Co-inertia analysis,CoIA), which correlates speech and lip movement in video and analyzes the consistency between speech and lip movement in video. It is divided into the training stage and the test analysis stage. In the training stage, the lip image in the audio and lip moving video is extracted from the feature, and the mapping matrix between the two is obtained. In the phase of test and analysis, the feature is projected on the mapping matrix, and the covariance mean of the projection value is the coefficient of correlation. The higher the correlation coefficient of coinertia analysis, the more correlation between audio and video, and the smaller the error rate, the better the performance of audio and video consistency evaluation. The experimental results show that the coinertia analysis algorithm has better performance for audio and video consistency analysis. The innovation of this paper is the selection of specific pronunciation units and the analysis of audio and video consistency based on specific pronunciation units. This paper first analyzes the vowel features of Chinese consonants. Based on the similarity of vowels, the vowels with the same oral parameters are clustered into one class, and the vowels are grouped into 16 categories. Secondly, the class of pronunciation unit with high correlation coefficient is selected as the specific pronunciation unit by the algorithm of coinertia analysis, and the consistency analysis of consistent and inconsistent data is carried out through the experimental construction to verify the rationality of the selected specific pronunciation unit. Finally, the consistency of audio and video between the whole sentence and the specific phonetic unit extracted from the whole sentence is analyzed, and the clustering based on the specific syllable is compared with the whole sentence. The experimental database uses 350 sentences, one of which is about 3 to 10 seconds long, and extracts and recognizes seven groups of specific phonetic units for each sentence by energy, zero crossing rate, and fundamental frequency. The length of a specific pronunciation unit is about 0.3 seconds to 0.8 seconds. A sample of a specific pronunciation unit is extracted from 350 sentences. The length of time of a specific pronunciation unit is about 3/4 less than that of the whole sentence, and the amount of computing data is reduced. The coinertia analysis algorithm is used to analyze the consistency of specific pronunciation units and the whole sentence. The experimental results show that the error rate of consistency evaluation of specific pronunciation units is 2.7% lower than that of the whole sentence.
【學位授予單位】:華南理工大學
【學位級別】:碩士
【學位授予年份】:2013
【分類號】:TN912.3
本文編號:2345176
[Abstract]:The aging of our country is becoming more and more serious, the payment of social security funds is facing more and more serious fraud problems, such as fraud, and the problem of identity authentication of suitable beneficiaries is becoming more and more prominent. It is often reported that there are lip-synching problems in large-scale concerts, but there is no real evidence. It is necessary to detect the suspected lip-synching; animation industry is a low-carbon industry encouraged by the state, and the quality of animation dubbing is also lack of objective evaluation technology. Because real speech is produced by human phonetic organs, there is strict consistency between speech signal and lip movement information. From the aspect of audio and video consistency analysis, this paper discusses the authenticity of voice samples based on voice identity authentication, improves the accuracy of identity authentication for beneficiaries of social security funds, and effectively prevents fraud. At the same time, it also provides the technical basis for the objective evaluation of dubbing quality and the solution of lip-synching problems. In this paper, a method of audio and video consistency analysis based on specific pronunciation unit is proposed. The basic analysis algorithm is coinertia analysis algorithm (Co-inertia analysis,CoIA), which correlates speech and lip movement in video and analyzes the consistency between speech and lip movement in video. It is divided into the training stage and the test analysis stage. In the training stage, the lip image in the audio and lip moving video is extracted from the feature, and the mapping matrix between the two is obtained. In the phase of test and analysis, the feature is projected on the mapping matrix, and the covariance mean of the projection value is the coefficient of correlation. The higher the correlation coefficient of coinertia analysis, the more correlation between audio and video, and the smaller the error rate, the better the performance of audio and video consistency evaluation. The experimental results show that the coinertia analysis algorithm has better performance for audio and video consistency analysis. The innovation of this paper is the selection of specific pronunciation units and the analysis of audio and video consistency based on specific pronunciation units. This paper first analyzes the vowel features of Chinese consonants. Based on the similarity of vowels, the vowels with the same oral parameters are clustered into one class, and the vowels are grouped into 16 categories. Secondly, the class of pronunciation unit with high correlation coefficient is selected as the specific pronunciation unit by the algorithm of coinertia analysis, and the consistency analysis of consistent and inconsistent data is carried out through the experimental construction to verify the rationality of the selected specific pronunciation unit. Finally, the consistency of audio and video between the whole sentence and the specific phonetic unit extracted from the whole sentence is analyzed, and the clustering based on the specific syllable is compared with the whole sentence. The experimental database uses 350 sentences, one of which is about 3 to 10 seconds long, and extracts and recognizes seven groups of specific phonetic units for each sentence by energy, zero crossing rate, and fundamental frequency. The length of a specific pronunciation unit is about 0.3 seconds to 0.8 seconds. A sample of a specific pronunciation unit is extracted from 350 sentences. The length of time of a specific pronunciation unit is about 3/4 less than that of the whole sentence, and the amount of computing data is reduced. The coinertia analysis algorithm is used to analyze the consistency of specific pronunciation units and the whole sentence. The experimental results show that the error rate of consistency evaluation of specific pronunciation units is 2.7% lower than that of the whole sentence.
【學位授予單位】:華南理工大學
【學位級別】:碩士
【學位授予年份】:2013
【分類號】:TN912.3
【參考文獻】
相關期刊論文 前10條
1 胡光銳,韋曉東;基于倒譜特征的帶噪語音端點檢測[J];電子學報;2000年10期
2 姚鴻勛,高文,王瑞,郎咸波;視覺語言——唇讀綜述[J];電子學報;2001年02期
3 楊丹寧,郭峰,文成義;由文本至口形的媒體變換技術的研究[J];電子學報;1996年01期
4 陳雁翔;劉鳴;;基于發(fā)音特征的音視頻說話人識別魯棒性的研究[J];電子學報;2010年12期
5 周治,杜利民,徐彥君;漢語聽覺視覺雙模態(tài)信息的互補作用[J];中國科學E輯:技術科學;2000年03期
6 洪曉鵬,姚鴻勛,徐銘輝;基于句子級的唇讀語料庫及其切分算法[J];計算機工程與應用;2005年03期
7 柴秀娟;姚鴻勛;高文;王瑞;;唇讀識別中的基本口型分類[J];計算機科學;2002年02期
8 王琢玉,賀前華;基于主元分析的人臉特征點定位算法的研究[J];計算機應用;2005年11期
9 單衛(wèi),姚鴻勛,高文;唇讀中序列口型的分類[J];中文信息學報;2002年01期
10 劉青山,盧漢清,馬頌德;綜述人臉識別中的子空間方法[J];自動化學報;2003年06期
本文編號:2345176
本文鏈接:http://sikaile.net/wenyilunwen/dongmansheji/2345176.html
教材專著