語音驅(qū)動虛擬說話人研究

發(fā)布時間：2018-09-05 18:22

【摘要】：語音驅(qū)動虛擬說話人技術(shù)指的是通過輸入語音信息生成虛擬人面部動畫。不僅提高用戶對語音的理解度,而且提供一種真實、友好的人機(jī)交互方式。隨著該技術(shù)的發(fā)展進(jìn)步,勢必為我們帶來更多新的人機(jī)交互體驗,極大豐富我們的日常生活。本論文采用兩種方案研究語音驅(qū)動虛擬說話人動畫合成,并對其進(jìn)行分析對比。第一種方案,基于深度神經(jīng)網(wǎng)絡(luò)的語音驅(qū)動發(fā)音器官運動合成。第二種方案,基于MPEG-4的語音驅(qū)動虛擬說話人動畫合成。這兩種方案均需要找到相應(yīng)的語料庫,并對其進(jìn)行提取構(gòu)建出適合本論文研究問題的聲視覺數(shù)據(jù)。第一種方案:語音的產(chǎn)生與聲道發(fā)音器官的運動直接相關(guān),如唇部、舌頭和軟腭的位置與移動。通過深度神經(jīng)網(wǎng)絡(luò)學(xué)習(xí)語音特征參數(shù)與發(fā)音器官位置信息兩者之間的映射關(guān)系,系統(tǒng)根據(jù)輸入的語音數(shù)據(jù)估計出發(fā)音器官的運動軌跡,并將其體現(xiàn)在一個三維的虛擬人上面。首先,在一系列參數(shù)下對比傳統(tǒng)神經(jīng)網(wǎng)絡(luò)(Artificial Neural Network,ANN)和深度神經(jīng)網(wǎng)絡(luò)(Deep Neural Network,DNN)的實驗結(jié)果,得到最優(yōu)網(wǎng)絡(luò);其次,設(shè)置不同上下文語音特征參數(shù)長度并調(diào)整隱藏層單元數(shù),獲取最佳的上下文長度;最后,選取最優(yōu)網(wǎng)絡(luò)結(jié)構(gòu),由最優(yōu)網(wǎng)絡(luò)輸出的發(fā)音器官運動軌跡信息控制發(fā)音器官運動合成,實現(xiàn)虛擬人動畫合成。第二種方案:基于MPEG-4的語音驅(qū)動虛擬說話人動畫合成的方法是一種數(shù)據(jù)驅(qū)動方法。首先,本論文從LIPS2008數(shù)據(jù)庫中提取構(gòu)建出適合本論文的聲視覺語料庫。然后,使用BP(Back Propagation)神經(jīng)網(wǎng)絡(luò)的方法學(xué)習(xí)語音特征參數(shù)與虛擬人人臉動畫參數(shù)(Facial Animation Parameters,FAP)兩者之間的映射關(guān)系。最后,系統(tǒng)根據(jù)預(yù)測得到的FAP序列控制虛擬人面部模型合成虛擬人口型動畫。本論文分別對兩種方案合成的動畫進(jìn)行主客觀評價,均證明兩種方案的有效性,并且動畫效果自然逼真。對比兩種動畫合成方案,第一種方案需要一個與之相適應(yīng)的唇部模型,雖然其精準(zhǔn)度較高,但通用性不強(qiáng),且其語料庫不易獲得。第二種方案符合MPEG-4標(biāo)準(zhǔn),使用FAP序列驅(qū)動的虛擬人面部模型合成動畫,其通用性更強(qiáng),更便于廣泛應(yīng)用。
[Abstract]:Speech driven virtual speaker technology refers to the generation of virtual human facial animation by input of speech information. It not only improves the user's understanding of speech, but also provides a real and friendly way of human-computer interaction. With the development of the technology, it will bring us more new human-computer interaction experience and enrich our daily life. In this paper, two schemes are used to study speech driven virtual speaker animation synthesis, and to analyze and compare them. The first scheme is speech driven speech motion synthesis based on deep neural network. The second scheme is speech driven virtual speaker animation synthesis based on MPEG-4. Both schemes need to find the corresponding corpus and construct sound vision data suitable for the research of this paper. The first scheme: speech production is directly related to the movement of vocal organs, such as the lip, tongue and soft palate position and movement. The mapping relationship between speech feature parameters and speech organ location information is studied by deep neural network. The system estimates the movement track of the speech organ according to the input speech data and embodies it on a 3D virtual human. First, the optimal network is obtained by comparing the experimental results of traditional neural network (Artificial Neural Network,ANN) and depth neural network (Deep Neural Network,DNN) with a series of parameters. Secondly, the length of speech feature parameters of different contexts is set and the number of hidden layer cells is adjusted. The optimal context length is obtained. Finally, the optimal network structure is selected, and the motion path information of the speech organ output by the optimal network is used to control the speech organ motion synthesis, and the virtual human animation synthesis is realized. The second scheme: speech driven virtual speaker animation synthesis method based on MPEG-4 is a data-driven method. Firstly, the sound visual corpus is constructed from LIPS2008 database. Then, BP (Back Propagation) neural network is used to study the mapping relationship between speech feature parameters and virtual human face animation parameters (Facial Animation Parameters,FAP). Finally, according to the predicted FAP sequence, the virtual human facial model is controlled to synthesize virtual population animation. In this paper, the animations synthesized by the two schemes are evaluated subjectively and objectively, and the validity of the two schemes is proved, and the animation effect is natural and lifelike. Compared with the two animation synthesis schemes, the first one needs a lip model which is suitable for it. Although its accuracy is high, it is not universal enough, and its corpus is not easy to obtain. The second scheme conforms to MPEG-4 standard and uses FAP sequence driven virtual human facial model to synthesize animation, which is more versatile and more convenient for wide application.
【學(xué)位授予單位】：西南交通大學(xué)
【學(xué)位級別】：碩士
【學(xué)位授予年份】：2017
【分類號】：TN912.3

【參考文獻(xiàn)】

相關(guān)期刊論文前10條

1 吳志明;侯進(jìn);位雪嶺;;基于運動分解與權(quán)重函數(shù)的嘴部中文語音動畫[J];計算機(jī)應(yīng)用研究;2016年12期

2 雷騰;侯進(jìn);王獻(xiàn);;基于改進(jìn)Candide-3模型的眼部動畫建模[J];哈爾濱工程大學(xué)學(xué)報;2015年04期

3 萬賢美;金小剛;;真實感3D人臉表情合成技術(shù)研究進(jìn)展[J];計算機(jī)輔助設(shè)計與圖形學(xué)學(xué)報;2014年02期

4 王婭;侯進(jìn);王獻(xiàn);;基于頂點權(quán)重的網(wǎng)格簡化在虛擬人臉中的應(yīng)用[J];計算機(jī)仿真;2014年02期

5 李冰鋒;謝磊;朱鵬程;樊博;;語音驅(qū)動虛擬說話人的自然頭動生成[J];清華大學(xué)學(xué)報(自然科學(xué)版);2013年06期

6 楊逸;侯進(jìn);王獻(xiàn);;基于運動軌跡分析的3D唇舌肌肉控制模型[J];計算機(jī)應(yīng)用研究;2013年07期

7 李皓;陳艷艷;唐朝京;;唇部子運動與權(quán)重函數(shù)表征的漢語動態(tài)視位[J];信號處理;2012年03期

8 李冰鋒;謝磊;周祥增;付中華;張艷寧;;實時語音驅(qū)動的虛擬說話人[J];清華大學(xué)學(xué)報(自然科學(xué)版);2011年09期

9 范懿文;柳學(xué)成;夏時洪;;人臉表情動畫與語音的典型相關(guān)性分析[J];計算機(jī)輔助設(shè)計與圖形學(xué)學(xué)報;2011年05期

10 尹寶才;王愷;王立春;;基于MPEG-4的融合多元素的三維人臉動畫合成方法[J];北京工業(yè)大學(xué)學(xué)報;2011年02期

，

本文編號：2225080

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/xinxigongchenglunwen/2225080.html

上一篇：基于多業(yè)務(wù)類型的異構(gòu)無線網(wǎng)絡(luò)切換算法
下一篇：數(shù)字電視VOD視頻點播系統(tǒng)的設(shè)計與實現(xiàn)

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

語音驅(qū)動虛擬說話人研究