語音驅動虛擬說話人研究
發(fā)布時間:2018-09-05 18:22
【摘要】:語音驅動虛擬說話人技術指的是通過輸入語音信息生成虛擬人面部動畫。不僅提高用戶對語音的理解度,而且提供一種真實、友好的人機交互方式。隨著該技術的發(fā)展進步,勢必為我們帶來更多新的人機交互體驗,極大豐富我們的日常生活。本論文采用兩種方案研究語音驅動虛擬說話人動畫合成,并對其進行分析對比。第一種方案,基于深度神經(jīng)網(wǎng)絡的語音驅動發(fā)音器官運動合成。第二種方案,基于MPEG-4的語音驅動虛擬說話人動畫合成。這兩種方案均需要找到相應的語料庫,并對其進行提取構建出適合本論文研究問題的聲視覺數(shù)據(jù)。第一種方案:語音的產(chǎn)生與聲道發(fā)音器官的運動直接相關,如唇部、舌頭和軟腭的位置與移動。通過深度神經(jīng)網(wǎng)絡學習語音特征參數(shù)與發(fā)音器官位置信息兩者之間的映射關系,系統(tǒng)根據(jù)輸入的語音數(shù)據(jù)估計出發(fā)音器官的運動軌跡,并將其體現(xiàn)在一個三維的虛擬人上面。首先,在一系列參數(shù)下對比傳統(tǒng)神經(jīng)網(wǎng)絡(Artificial Neural Network,ANN)和深度神經(jīng)網(wǎng)絡(Deep Neural Network,DNN)的實驗結果,得到最優(yōu)網(wǎng)絡;其次,設置不同上下文語音特征參數(shù)長度并調整隱藏層單元數(shù),獲取最佳的上下文長度;最后,選取最優(yōu)網(wǎng)絡結構,由最優(yōu)網(wǎng)絡輸出的發(fā)音器官運動軌跡信息控制發(fā)音器官運動合成,實現(xiàn)虛擬人動畫合成。第二種方案:基于MPEG-4的語音驅動虛擬說話人動畫合成的方法是一種數(shù)據(jù)驅動方法。首先,本論文從LIPS2008數(shù)據(jù)庫中提取構建出適合本論文的聲視覺語料庫。然后,使用BP(Back Propagation)神經(jīng)網(wǎng)絡的方法學習語音特征參數(shù)與虛擬人人臉動畫參數(shù)(Facial Animation Parameters,FAP)兩者之間的映射關系。最后,系統(tǒng)根據(jù)預測得到的FAP序列控制虛擬人面部模型合成虛擬人口型動畫。本論文分別對兩種方案合成的動畫進行主客觀評價,均證明兩種方案的有效性,并且動畫效果自然逼真。對比兩種動畫合成方案,第一種方案需要一個與之相適應的唇部模型,雖然其精準度較高,但通用性不強,且其語料庫不易獲得。第二種方案符合MPEG-4標準,使用FAP序列驅動的虛擬人面部模型合成動畫,其通用性更強,更便于廣泛應用。
[Abstract]:Speech driven virtual speaker technology refers to the generation of virtual human facial animation by input of speech information. It not only improves the user's understanding of speech, but also provides a real and friendly way of human-computer interaction. With the development of the technology, it will bring us more new human-computer interaction experience and enrich our daily life. In this paper, two schemes are used to study speech driven virtual speaker animation synthesis, and to analyze and compare them. The first scheme is speech driven speech motion synthesis based on deep neural network. The second scheme is speech driven virtual speaker animation synthesis based on MPEG-4. Both schemes need to find the corresponding corpus and construct sound vision data suitable for the research of this paper. The first scheme: speech production is directly related to the movement of vocal organs, such as the lip, tongue and soft palate position and movement. The mapping relationship between speech feature parameters and speech organ location information is studied by deep neural network. The system estimates the movement track of the speech organ according to the input speech data and embodies it on a 3D virtual human. First, the optimal network is obtained by comparing the experimental results of traditional neural network (Artificial Neural Network,ANN) and depth neural network (Deep Neural Network,DNN) with a series of parameters. Secondly, the length of speech feature parameters of different contexts is set and the number of hidden layer cells is adjusted. The optimal context length is obtained. Finally, the optimal network structure is selected, and the motion path information of the speech organ output by the optimal network is used to control the speech organ motion synthesis, and the virtual human animation synthesis is realized. The second scheme: speech driven virtual speaker animation synthesis method based on MPEG-4 is a data-driven method. Firstly, the sound visual corpus is constructed from LIPS2008 database. Then, BP (Back Propagation) neural network is used to study the mapping relationship between speech feature parameters and virtual human face animation parameters (Facial Animation Parameters,FAP). Finally, according to the predicted FAP sequence, the virtual human facial model is controlled to synthesize virtual population animation. In this paper, the animations synthesized by the two schemes are evaluated subjectively and objectively, and the validity of the two schemes is proved, and the animation effect is natural and lifelike. Compared with the two animation synthesis schemes, the first one needs a lip model which is suitable for it. Although its accuracy is high, it is not universal enough, and its corpus is not easy to obtain. The second scheme conforms to MPEG-4 standard and uses FAP sequence driven virtual human facial model to synthesize animation, which is more versatile and more convenient for wide application.
【學位授予單位】:西南交通大學
【學位級別】:碩士
【學位授予年份】:2017
【分類號】:TN912.3
本文編號:2225080
[Abstract]:Speech driven virtual speaker technology refers to the generation of virtual human facial animation by input of speech information. It not only improves the user's understanding of speech, but also provides a real and friendly way of human-computer interaction. With the development of the technology, it will bring us more new human-computer interaction experience and enrich our daily life. In this paper, two schemes are used to study speech driven virtual speaker animation synthesis, and to analyze and compare them. The first scheme is speech driven speech motion synthesis based on deep neural network. The second scheme is speech driven virtual speaker animation synthesis based on MPEG-4. Both schemes need to find the corresponding corpus and construct sound vision data suitable for the research of this paper. The first scheme: speech production is directly related to the movement of vocal organs, such as the lip, tongue and soft palate position and movement. The mapping relationship between speech feature parameters and speech organ location information is studied by deep neural network. The system estimates the movement track of the speech organ according to the input speech data and embodies it on a 3D virtual human. First, the optimal network is obtained by comparing the experimental results of traditional neural network (Artificial Neural Network,ANN) and depth neural network (Deep Neural Network,DNN) with a series of parameters. Secondly, the length of speech feature parameters of different contexts is set and the number of hidden layer cells is adjusted. The optimal context length is obtained. Finally, the optimal network structure is selected, and the motion path information of the speech organ output by the optimal network is used to control the speech organ motion synthesis, and the virtual human animation synthesis is realized. The second scheme: speech driven virtual speaker animation synthesis method based on MPEG-4 is a data-driven method. Firstly, the sound visual corpus is constructed from LIPS2008 database. Then, BP (Back Propagation) neural network is used to study the mapping relationship between speech feature parameters and virtual human face animation parameters (Facial Animation Parameters,FAP). Finally, according to the predicted FAP sequence, the virtual human facial model is controlled to synthesize virtual population animation. In this paper, the animations synthesized by the two schemes are evaluated subjectively and objectively, and the validity of the two schemes is proved, and the animation effect is natural and lifelike. Compared with the two animation synthesis schemes, the first one needs a lip model which is suitable for it. Although its accuracy is high, it is not universal enough, and its corpus is not easy to obtain. The second scheme conforms to MPEG-4 standard and uses FAP sequence driven virtual human facial model to synthesize animation, which is more versatile and more convenient for wide application.
【學位授予單位】:西南交通大學
【學位級別】:碩士
【學位授予年份】:2017
【分類號】:TN912.3
【參考文獻】
相關期刊論文 前10條
1 吳志明;侯進;位雪嶺;;基于運動分解與權重函數(shù)的嘴部中文語音動畫[J];計算機應用研究;2016年12期
2 雷騰;侯進;王獻;;基于改進Candide-3模型的眼部動畫建模[J];哈爾濱工程大學學報;2015年04期
3 萬賢美;金小剛;;真實感3D人臉表情合成技術研究進展[J];計算機輔助設計與圖形學學報;2014年02期
4 王婭;侯進;王獻;;基于頂點權重的網(wǎng)格簡化在虛擬人臉中的應用[J];計算機仿真;2014年02期
5 李冰鋒;謝磊;朱鵬程;樊博;;語音驅動虛擬說話人的自然頭動生成[J];清華大學學報(自然科學版);2013年06期
6 楊逸;侯進;王獻;;基于運動軌跡分析的3D唇舌肌肉控制模型[J];計算機應用研究;2013年07期
7 李皓;陳艷艷;唐朝京;;唇部子運動與權重函數(shù)表征的漢語動態(tài)視位[J];信號處理;2012年03期
8 李冰鋒;謝磊;周祥增;付中華;張艷寧;;實時語音驅動的虛擬說話人[J];清華大學學報(自然科學版);2011年09期
9 范懿文;柳學成;夏時洪;;人臉表情動畫與語音的典型相關性分析[J];計算機輔助設計與圖形學學報;2011年05期
10 尹寶才;王愷;王立春;;基于MPEG-4的融合多元素的三維人臉動畫合成方法[J];北京工業(yè)大學學報;2011年02期
,本文編號:2225080
本文鏈接:http://sikaile.net/kejilunwen/xinxigongchenglunwen/2225080.html