跨庫語音情感識(shí)別若干關(guān)鍵技術(shù)研究
本文關(guān)鍵詞:跨庫語音情感識(shí)別若干關(guān)鍵技術(shù)研究 出處:《東南大學(xué)》2016年博士論文 論文類型:學(xué)位論文
更多相關(guān)文章: 語音情感識(shí)別 跨數(shù)據(jù)庫 學(xué)生t分布 語譜圖特征 選擇注意機(jī)制 深度信念網(wǎng)絡(luò) 特征自適應(yīng)
【摘要】:語音情感識(shí)別(Speech Emotion Recognition,SER)是目前情感計(jì)算、模式識(shí)別、信號(hào)處理和人機(jī)交互領(lǐng)域的熱門研究話題。SER的主要目的是對語音信號(hào)按照不同的情感進(jìn)行分類,比如"生氣"、"恐懼"、"厭惡"、"高興"等。在過去的幾年里,已經(jīng)提出了許多有效的方法來應(yīng)對SER中出現(xiàn)的問題。在各種研究方法中,大部分是集中在一個(gè)單一的語音數(shù)據(jù)庫上進(jìn)行的。然而,在許多實(shí)際應(yīng)用情況下,用于訓(xùn)練的語料庫與測試語料庫存在非常大的差異,例如訓(xùn)練和測試數(shù)據(jù)庫來自兩種(或更多種)不同的語言、說話人、文化、分布方式、數(shù)據(jù)規(guī)模等。這就出現(xiàn)了一個(gè)重要的研究內(nèi)容:跨數(shù)據(jù)庫(Cross-corpus)的語音情感識(shí)別。由于SER的研究涉及特征提取、特征優(yōu)選、分類器改進(jìn)、特征融合等多個(gè)技術(shù)部分,因此本文根據(jù)其特點(diǎn),針對跨數(shù)據(jù)庫語音情感識(shí)別相關(guān)的關(guān)鍵技術(shù)進(jìn)行研究。論文的主要研究內(nèi)容如下:1.針對跨庫語音情感特征優(yōu)選分類,提出了帶有無限成分?jǐn)?shù)的t分布混合模型(iSMM)。它可以直接對多種語音情感樣本進(jìn)行有效的識(shí)別。與傳統(tǒng)的高斯混合模型(GMM)相比,基于混合t分布的語音情感模型能有效處理樣本特征空間中存在異常值的問題。首先,t分布混合模型對用于測試的非典型情感數(shù)據(jù)能夠保持魯棒性。其次,針對高維空間引起的數(shù)據(jù)高復(fù)雜度和訓(xùn)練樣本不足的問題,本文將全局隱空間加入情感模型。這種方法使樣本空間被劃分的成分?jǐn)?shù)量為無限,形成一個(gè)iSMM情感模型。此外,該模型可以自動(dòng)確定最佳的成分?jǐn)?shù)量,同時(shí)滿足低復(fù)雜性,進(jìn)而完成多種情感特征數(shù)據(jù)的分類。為驗(yàn)證所提出的iSMM模型對于不同情感特征分布空間的識(shí)別效果,本文在3個(gè)數(shù)據(jù)庫上進(jìn)行仿真實(shí)驗(yàn),分別是:表演型語料庫DES、EMO-DB和自發(fā)型語料庫FAU。它們都是通用的語音情感數(shù)據(jù)庫,且具有高維特征樣本和不同的空間分布。在這種實(shí)驗(yàn)條件下,驗(yàn)證了各個(gè)模型對于特征異常值和高維數(shù)據(jù)的優(yōu)選效果以及模型本身的泛化性。結(jié)果顯示iSMM相比其它對比模型,保持了更穩(wěn)定的識(shí)別性能。因此說明本文提出的基于無限t分布的情感模型,在處理不同來源的語音數(shù)據(jù)時(shí)具有較好的魯棒性,且對帶有離群值的高維情感特征具有良好的優(yōu)選識(shí)別能力。2.結(jié)合K近鄰、核學(xué)習(xí)方法、特征線重心法和LDA算法,提出了用于情感識(shí)別的LDA+kernel-KNNFLC方法。首先針對過大的先驗(yàn)樣本特征數(shù)目造成的計(jì)算量龐大問題,采用重心準(zhǔn)則學(xué)習(xí)樣本距離,改進(jìn)了核學(xué)習(xí)的K近鄰方法;然后加入LDA對情感特征向量優(yōu)化,在避免維度冗余的情況下,更好的保證了類間情感信息識(shí)別的穩(wěn)定性。對于跨庫領(lǐng)域的研究,關(guān)注了獨(dú)立數(shù)據(jù)庫中不同類別間邊界擬合度過高導(dǎo)致的識(shí)別性能差異;通過對特征空間再學(xué)習(xí),所提出的分類方法優(yōu)化了情感特征向量的類間區(qū)分度,適合于不同語料來源的情感特征分類。在包含高維全局統(tǒng)計(jì)特征的兩個(gè)語音情感數(shù)據(jù)庫上進(jìn)行了仿真實(shí)驗(yàn)。通過降維方案、情感分類器和維度參數(shù)進(jìn)行多組實(shí)驗(yàn)對比分析,結(jié)果表明:LDA+kernel-KNNFLC方法在同條件下識(shí)別性能有顯著提升,具有相對穩(wěn)定的情感類別間分類能力。3.針對跨庫條件下情感特征類別的改進(jìn)(擴(kuò)充)研究,提出了基于聽覺注意模型的語譜圖特征提取方法。模型模擬人耳聽覺特性,能有效探測語譜圖上變化的情感特征。同時(shí),利用時(shí)頻原子對模型進(jìn)行改進(jìn),取得頻率特性信號(hào)匹配的優(yōu)勢,從時(shí)域上提取情感信息。在語音情感識(shí)別技術(shù)中,由于噪聲環(huán)境、說話方式和說話人特質(zhì)等原因,會(huì)造成特征空間分布不匹配的情況。從語音學(xué)上分析,該問題多存在于跨數(shù)據(jù)庫情感識(shí)別任務(wù)中。訓(xùn)練的聲學(xué)模型和用于測試的語句樣本之間的錯(cuò)位,會(huì)使語音情感識(shí)別性能急劇下降。語譜圖的特征能從圖像的角度對現(xiàn)有情感特征進(jìn)行有效的補(bǔ)充。聽覺注意機(jī)制使模型能提取跨語音數(shù)據(jù)庫中的顯著性特征,提高語音情感識(shí)別系統(tǒng)的情感辨識(shí)能力。仿真實(shí)驗(yàn)部分利用文章所提出的方法在跨庫情感樣本上進(jìn)行特征提取,再通過典型的分類器進(jìn)行識(shí)別。結(jié)果顯示:與國際通用的標(biāo)準(zhǔn)方法相比,語譜圖情感特征的識(shí)別性能提高了約9個(gè)百分點(diǎn),從而驗(yàn)證了該方法對不同數(shù)據(jù)庫具有更好的魯棒性。4.利用深度學(xué)習(xí)領(lǐng)域的深度信念模型,提出了基于深度信念網(wǎng)絡(luò)的特征層融合方法。將語音頻譜圖中隱含的情感信息作為圖像特征,與傳統(tǒng)聲學(xué)情感特征融合。研究解決了跨數(shù)據(jù)庫語音情感識(shí)別中,將不同尺度上提取的情感特征相結(jié)合的技術(shù)難點(diǎn)。利用STB/Itti模型對語譜圖進(jìn)行分析,從顏色、亮度、方向三個(gè)角度出發(fā)提取語譜圖特征;然后研究改進(jìn)了 DBN網(wǎng)絡(luò)模型,并利用其對傳統(tǒng)聲學(xué)特征與語譜圖特征進(jìn)行了特征層融合,擴(kuò)充了特征子集的尺度,提升了情感表征能力。通過在ABC數(shù)據(jù)庫和多個(gè)中文數(shù)據(jù)庫上的實(shí)驗(yàn)驗(yàn)證,特征融合后的新特征子集相比傳統(tǒng)的語音情感特征,其跨數(shù)據(jù)庫識(shí)別性能獲得了明顯提升。5.研究了由跨數(shù)據(jù)庫條件下不同語言的使用和大量非特定說話人引起的SER模型特征自適應(yīng)問題。根據(jù)前面章節(jié)所介紹的跨庫語音情感識(shí)別的內(nèi)容,對特征參數(shù)失真、語譜圖特征構(gòu)造、建模算法對比、在線優(yōu)化等方面進(jìn)行了自適應(yīng)相關(guān)的研究,并對具體的實(shí)驗(yàn)性能進(jìn)行了比較分析。首先,討論了現(xiàn)有的語音情感識(shí)別自適應(yīng)方法。然后,對于跨庫的情況,進(jìn)一步研究了自適應(yīng)說話人加性特征失真的情況,并給出模型方案。接著,為研究多說話人自適應(yīng)問題給SER系統(tǒng)帶來的影響,對其過程進(jìn)行建模,將高斯混合模型與學(xué)生t分布模型兩種統(tǒng)計(jì)方法進(jìn)行對比討論。再分別利用各自適應(yīng)方案來獲取包括語譜圖特征在內(nèi)的特征函數(shù)集。此外,還使用了一些在線數(shù)據(jù)對特征函數(shù)進(jìn)行了快速優(yōu)化。最后,在四種不同語言的數(shù)據(jù)庫上(包括:德語、英語、中文和越南語)驗(yàn)證了各自適應(yīng)方案的有效性。實(shí)驗(yàn)結(jié)果表明:改進(jìn)的自適應(yīng)方案具有良好的說話人特征自適應(yīng)效果,尤其在處理大量未知說話人的情況下顯示了較好的模型參數(shù)遷移能力。此外,對于由跨數(shù)據(jù)庫中不同語言對情感特性的影響,從特征自適應(yīng)角度進(jìn)行了實(shí)驗(yàn)分析和討論。
[Abstract]:Speech emotion recognition (Speech Emotion, Recognition, SER) is currently the affective computing, pattern recognition, the main purpose of.SER hot research topic in the field of signal processing and human-computer interaction is the voice signal according to the different emotion classification, such as "angry", "fear", "hate", "happy" in the past. In recent years, many effective methods have been proposed to deal with SER problems. In various research methods, mostly concentrated in a single speech database on. However, in many practical applications, for there is a big difference between the corpus and the test corpus training, such as training and the test database from two (or more) different languages, speaker, culture, distribution, the scale of the data. It is an important research content: cross database (Cross-corpus) of the speech emotion recognition. Because SER involves the study of feature extraction, feature selection, classifier, feature fusion and multiple technology, so this paper according to the characteristics of the cross database of speech emotion recognition key technology related research. The main contents of this dissertation are as follows: 1. for the cross database of speech emotion feature selection classification, put forward the infinite into scores the mixed model with t distribution (iSMM). It can effectively recognize the variety of emotional speech samples. And Gauss mixture model (GMM) compared with the traditional mixed speech emotion t model can effectively deal with the abnormal values of the samples in the feature space based on t. Firstly, a mixed model of distribution for atypical emotion the test data is able to maintain robustness. Secondly, the problem of high complexity and the lack of training samples in high dimensional space caused by the data, the global implicit space with emotion Model. This method makes the number of components in the sample space is divided into infinity, forming a iSMM emotion model. In addition, the model can automatically determine the optimal number of components, and meet the low complexity, and then complete the classification of many emotional feature data. Identification results for the validation of the proposed iSMM model for the spatial distribution of the different emotions in this paper, 3 databases for the simulation experiments were performed EMO-DB and DES corpus, the spontaneous corpus FAU. they are a common emotional speech database, and has a high dimensional feature space and the distribution of the sample is not the same. In this experiment, verify the generalization effect of each model for abnormal characteristics optimization value and high dimensional data and the model itself. The results show that iSMM compared to other comparative model, keep the recognition performance more stable. So based on the no The affective model of distribution of T, has good robustness in the voice data from different sources, and the high dimensional emotion feature with outliers has a good ability to identify the preferred.2. binding K neighbor, kernel methods, feature line centroid method and LDA algorithm, LDA+kernel-KNNFLC method for emotion recognition is proposed firstly. According to the huge amount of computation problems caused by large number of prior sample characteristics, the focus criterion learning sample distance, improved K nearest neighbor method of kernel learning; then add LDA to vector optimization in dimension of emotional characteristics, to avoid redundancy, ensuring better inter class emotion recognition stability. For the study of cross database in the field of attention to the result of high degree of recognition performance differences between different categories of quasi independent boundary database; through the study of the feature space, the proposed classification method was optimized. The sense of eigenvector separability between classes, suitable for the classification of emotional characteristics of different data sources. The simulation experiments are carried out in the two speech emotion database contains high dimensional global statistical characteristics. Based on the dimension reduction scheme, emotion classifier and dimension parameters were analyzed, the experimental results show that the LDA+kernel-KNNFLC method under the same conditions the recognition performance was significantly improved, improved according to the characteristics of emotional characteristics of cross database under the condition of relatively stable emotion categories classification ability of.3. (Extended) research, put forward the model of auditory language feature extraction method based on spectrum of tut. The model simulates human auditory characteristics, can effectively detect the spectral characteristics of emotional map changes. At the same time, the use of time-frequency atoms by improving the model and get the signal frequency characteristics, the advantage of extracting affective information from the time domain. In the speech emotion recognition technology, due to noise The environment, speech and speaker characteristics and other reasons, will cause the spatial distribution of the situation does not match. From the analysis of phonetics, the problem exists in the cross database emotion recognition task. Acoustic model training and testing samples for dislocation between the statement, will make the speech emotion recognition performance dramatically. Spectrogram the characteristics can be an effective supplement to the existing emotional features from the image point of view. The auditory attention mechanism so that the model can extract salient features of cross speech database, improve the emotion recognition ability of speech emotion recognition system. Simulation results using this approach for feature in cross database emotion sample extraction, identification through the typical classifier. The results showed that: compared with the standard of international practice, the recognition performance of spectrum feature increased by about 9 percentage points, which verified the This method has better robustness to different database.4. using the depth of field of the deep learning belief model, proposed the characteristics of deep belief network layer fusion method based on implicit spectrum in speech emotion information as image feature fusion and feature of traditional acoustic emotion. Research the cross database of speech emotion recognition, will affective feature extraction on different scales combined with technical difficulties. Use the STB/Itti model to the spectrogram analysis, from three angles of color, brightness, direction of spectrogram feature extraction; then studies the DBN network model is improved, and the use of traditional acoustic features and spectrum characteristics of feature fusion expansion of the scale, feature subset, enhance the emotional characterization ability. Through the ABC database and a Chinese database on the experimental verification, the new features of Zi Jixiang after feature fusion Than the traditional speech emotion features, the performance of cross database identification obtained significantly improved.5. was studied by using different languages and a large number of non cross database under SER model adaptive feature specific problems caused by the speaker. According to the previous chapters introduce the cross database of speech emotion recognition, distortion of the characteristic parameters, spectrum characteristics structure, comparative modeling algorithm, the thesis studies the relevant aspects of online adaptive optimization, and the performance of concrete are analyzed. Firstly, discusses the existing methods of adaptive speech emotion recognition. Then, for the further study of cross database, adaptive speaker additive feature distortion, and gives the model scheme then, for the influence of research on adaptive multi speaker problems to the SER system, carries on the modeling process, the Gauss mixture model and Student t distribution model Two statistical methods were compared and discussed. Then use their adaptation scheme to obtain the characteristic function including spectrogram feature set. In addition, also used the rapid optimization of the characteristic function of some online data. Finally, in four different languages on the database (including: German, English, Chinese and Vietnamese) verify the respective adaptation scheme is effective. The experimental results show that the adaptive scheme with improved speaker feature adaptive effect, especially in the treatment of a large number of unknown speaker cases showed that the migration model parameters with good ability. In addition, for the cross database in different languages on emotional characteristics are analyzed and discussed. From the angle of adaptive features.
【學(xué)位授予單位】:東南大學(xué)
【學(xué)位級(jí)別】:博士
【學(xué)位授予年份】:2016
【分類號(hào)】:TN912.34
【相似文獻(xiàn)】
相關(guān)期刊論文 前10條
1 林奕琳;韋崗;楊康才;;語音情感識(shí)別的研究進(jìn)展[J];電路與系統(tǒng)學(xué)報(bào);2007年01期
2 趙力;黃程韋;;實(shí)用語音情感識(shí)別中的若干關(guān)鍵技術(shù)[J];數(shù)據(jù)采集與處理;2014年02期
3 陳建廈,李翠華;語音情感識(shí)別的研究進(jìn)展[J];計(jì)算機(jī)工程;2005年13期
4 王茜;;一個(gè)語音情感識(shí)別系統(tǒng)的設(shè)計(jì)與實(shí)現(xiàn)[J];大眾科技;2006年08期
5 孫亞;;遠(yuǎn)程教學(xué)中語音情感識(shí)別系統(tǒng)的研究與實(shí)現(xiàn)[J];長春理工大學(xué)學(xué)報(bào)(高教版);2008年02期
6 章國寶;宋清華;費(fèi)樹岷;趙艷;;語音情感識(shí)別研究[J];計(jì)算機(jī)技術(shù)與發(fā)展;2009年01期
7 石瑛;胡學(xué)鋼;方磊;;基于決策樹的多特征語音情感識(shí)別[J];計(jì)算機(jī)技術(shù)與發(fā)展;2009年01期
8 趙臘生;張強(qiáng);魏小鵬;;語音情感識(shí)別研究進(jìn)展[J];計(jì)算機(jī)應(yīng)用研究;2009年02期
9 張石清;趙知?jiǎng)?;噪聲背景下的語音情感識(shí)別[J];西南交通大學(xué)學(xué)報(bào);2009年03期
10 黃程韋;金峗;王青云;趙艷;趙力;;基于特征空間分解與融合的語音情感識(shí)別[J];信號(hào)處理;2010年06期
相關(guān)會(huì)議論文 前8條
1 陳建廈;;語音情感識(shí)別綜述[A];第一屆中國情感計(jì)算及智能交互學(xué)術(shù)會(huì)議論文集[C];2003年
2 楊桃香;楊鑒;畢福昆;;基于模糊聚類的語音情感識(shí)別[A];第三屆和諧人機(jī)環(huán)境聯(lián)合學(xué)術(shù)會(huì)議(HHME2007)論文集[C];2007年
3 羅武駿;包永強(qiáng);趙力;;基于模糊支持向量機(jī)的語音情感識(shí)別方法[A];2012'中國西部聲學(xué)學(xué)術(shù)交流會(huì)論文集(Ⅱ)[C];2012年
4 王青;謝波;陳根才;;基于神經(jīng)網(wǎng)絡(luò)的漢語語音情感識(shí)別[A];第一屆中國情感計(jì)算及智能交互學(xué)術(shù)會(huì)議論文集[C];2003年
5 張鼎天;徐明星;;基于調(diào)制頻譜特征的自動(dòng)語音情感識(shí)別[A];第十二屆全國人機(jī)語音通訊學(xué)術(shù)會(huì)議(NCMMSC'2013)論文集[C];2013年
6 童燦;;基于boosting HMM的語音情感識(shí)別[A];2008年中國高校通信類院系學(xué)術(shù)研討會(huì)論文集(下冊)[C];2009年
7 戴明洋;楊大利;徐明星;;語音情感識(shí)別中UBM訓(xùn)練集的組成研究[A];第十一屆全國人機(jī)語音通訊學(xué)術(shù)會(huì)議論文集(一)[C];2011年
8 張衛(wèi);張雪英;孫穎;;基于HHT邊際Teager能量譜的語音情感識(shí)別[A];第十二屆全國人機(jī)語音通訊學(xué)術(shù)會(huì)議(NCMMSC'2013)論文集[C];2013年
相關(guān)博士學(xué)位論文 前8條
1 孫亞新;語音情感識(shí)別中的特征提取與識(shí)別算法研究[D];華南理工大學(xué);2015年
2 王坤俠;語音情感識(shí)別方法研究[D];合肥工業(yè)大學(xué);2015年
3 張昕然;跨庫語音情感識(shí)別若干關(guān)鍵技術(shù)研究[D];東南大學(xué);2016年
4 韓文靜;語音情感識(shí)別關(guān)鍵技術(shù)研究[D];哈爾濱工業(yè)大學(xué);2013年
5 謝波;普通話語音情感識(shí)別關(guān)鍵技術(shù)研究[D];浙江大學(xué);2006年
6 尤鳴宇;語音情感識(shí)別的關(guān)鍵技術(shù)研究[D];浙江大學(xué);2007年
7 劉佳;語音情感識(shí)別的研究與應(yīng)用[D];浙江大學(xué);2009年
8 趙臘生;語音情感特征提取與識(shí)別方法研究[D];大連理工大學(xué);2010年
相關(guān)碩士學(xué)位論文 前10條
1 陳曉東;基于卷積神經(jīng)網(wǎng)絡(luò)的語音情感識(shí)別[D];華南理工大學(xué);2015年
2 孫志鋒;語音情感識(shí)別研究[D];陜西師范大學(xué);2015年
3 譚發(fā)曾;語音情感狀態(tài)模糊識(shí)別研究[D];電子科技大學(xué);2015年
4 陳鑫;相空間重構(gòu)在語音情感識(shí)別中的研究[D];長沙理工大學(xué);2014年
5 李昌群;基于特征選擇的語音情感識(shí)別[D];合肥工業(yè)大學(xué);2015年
6 陳文汐;基于核函數(shù)的語音情感識(shí)別技術(shù)的研究[D];東南大學(xué);2015年
7 薛文韜;基于深度學(xué)習(xí)和遷移學(xué)習(xí)的語音情感識(shí)別方法研究[D];江蘇大學(xué);2016年
8 宋明虎;電力行業(yè)電話電話客服語音情感識(shí)別[D];昆明理工大學(xué);2016年
9 陳肖;基于多粒度特征融合的維度語音情感識(shí)別方法研究[D];哈爾濱工業(yè)大學(xué);2016年
10 任浩;基于多級(jí)分類的語音情感識(shí)別[D];哈爾濱工業(yè)大學(xué);2016年
,本文編號(hào):1374844
本文鏈接:http://sikaile.net/shoufeilunwen/xxkjbs/1374844.html