基于多深度模型集成的音頻場(chǎng)景分類方法研究
發(fā)布時(shí)間:2018-03-14 12:15
本文選題:音頻場(chǎng)景分類 切入點(diǎn):深度學(xué)習(xí) 出處:《哈爾濱工業(yè)大學(xué)》2017年碩士論文 論文類型:學(xué)位論文
【摘要】:音頻場(chǎng)景分類(Acoustic Scene Classification,ASC)是計(jì)算機(jī)聽覺場(chǎng)景分析(Computational Auditory Scene Analysis,CASA)領(lǐng)域的一種特定任務(wù),它根據(jù)音頻流的聲學(xué)內(nèi)容,識(shí)別其所對(duì)應(yīng)的特定場(chǎng)景語義標(biāo)簽,進(jìn)而達(dá)到感知和理解周邊環(huán)境的目的。與致力于理解人類感知音頻場(chǎng)景機(jī)制的心理學(xué)研究不同,音頻場(chǎng)景識(shí)別主要依賴信號(hào)處理技術(shù)和機(jī)器學(xué)習(xí)方法實(shí)現(xiàn)自動(dòng)識(shí)別音頻場(chǎng)景。傳統(tǒng)的ASC任務(wù),主要針對(duì)單個(gè)場(chǎng)景進(jìn)行特征提取和分類器選擇。隨著音頻采集設(shè)備的迅猛發(fā)展,各種各樣的音頻數(shù)據(jù)被大量收集,傳統(tǒng)的信號(hào)處理技術(shù)和識(shí)別方法面臨著重大的挑戰(zhàn),急需研究新的技術(shù)改善現(xiàn)狀。為了充分的利用繁多的音頻場(chǎng)景數(shù)據(jù),本文嘗試了各種深度學(xué)習(xí)方法,如多層感知機(jī)(Multi-Layer Perceptron,MLP)、卷積神經(jīng)網(wǎng)絡(luò)(Convolutional Neural Network,CNN)、長(zhǎng)短時(shí)循環(huán)神經(jīng)網(wǎng)絡(luò)(Long Short-Term Memory,LSTM)等。首先,提取音頻的幀級(jí)特征,包括:梅爾頻率倒譜系數(shù)MFCC(Mel-Frequency Cepstral Coefficients,MFCC)特征和對(duì)數(shù)梅爾譜(Log-Mel Spectrogram)特征,然后將音頻幀拼接成段特征,輸入到深度學(xué)習(xí)模型進(jìn)行識(shí)別分類。為了改善基于LSTM模型的ASC系統(tǒng),本文提出了一種基于亂序自助采樣法的段處理技術(shù)。這種段處理技術(shù)不僅可以模擬復(fù)雜的時(shí)序組合關(guān)系,而且可以擴(kuò)大訓(xùn)練數(shù)據(jù)規(guī)模,從而使模型的泛化能力更強(qiáng)。為了改善基于MLP模型的ASC方法,本文在模型結(jié)構(gòu)中引入了Attention機(jī)制。通過引入Attention機(jī)制,可以突破數(shù)據(jù)全局表征的局限,更關(guān)注數(shù)據(jù)的關(guān)鍵部分。同時(shí),Attention機(jī)制能很好的處理去耦合問題,即用不同的特征空間來描述不同的場(chǎng)景。不同種類的深度學(xué)習(xí)方法對(duì)不同場(chǎng)景的識(shí)別能力不同,如MLP能很好的識(shí)別的場(chǎng)景是沙灘、居民區(qū),而CNN更易區(qū)分圖書館、公交車。而集成學(xué)習(xí)通過將多個(gè)學(xué)習(xí)器進(jìn)行結(jié)合,常可獲得比單一學(xué)習(xí)器顯著優(yōu)越的泛化性能。所以,為了集成各種分類器的在不同場(chǎng)景上的識(shí)別優(yōu)勢(shì),本文采用了各種集成學(xué)習(xí)融合方法,其中基于BAGGING(Bootstrap AGGregat ING)框架的集成選擇方法,使得ASC任務(wù)的分類性能得到了明顯提升。
[Abstract]:Audio scene classification is a specific task in the field of computer auditory Auditory Scene Analysis (CASA), which recognizes the semantic labels of specific scenes according to the acoustic content of audio streams. To achieve the purpose of perceiving and understanding the surrounding environment, as opposed to the psychological research devoted to understanding the mechanism of human perception of audio scene, Audio scene recognition mainly depends on signal processing technology and machine learning method to realize automatic recognition of audio scene. Traditional ASC task mainly focuses on feature extraction and classifier selection for a single scene. All kinds of audio data are collected in large quantities. Traditional signal processing technology and recognition methods are facing great challenges. It is urgent to study new technologies to improve the current situation. In order to make full use of a wide range of audio scene data, This paper attempts various depth learning methods, such as Multi-Layer Perceptron (MLP), Convolutional Neural Network (CNN), long Short-Term memory (LSTM) and so on. Firstly, the frame level features of audio frequency are extracted. It includes: Mel frequency cepstrum coefficient MFCC(Mel-Frequency Cepstral coefficients (MFCC) feature and log-Mel spectrum Log-Mel spectrogramgram feature. Then the audio frames are spliced into segment features and input into the depth learning model for recognition and classification. In order to improve the ASC system based on LSTM model, In this paper, a segment processing technique based on out-of-order self-help sampling method is proposed, which can not only simulate complex time series combination relations, but also enlarge the scale of training data. In order to improve the ASC method based on MLP model, the Attention mechanism is introduced into the model structure. By introducing the Attention mechanism, the limitation of global representation of data can be broken through. At the same time, the attention mechanism can deal with the decoupling problem well, that is to say, different feature spaces are used to describe different scenarios. Different kinds of depth learning methods have different recognition ability for different scenes. For example, MLP can well identify the scenes of beach and residential areas, while CNN is easier to distinguish between libraries and buses. Integrated learning can often achieve significantly better generalization performance than a single learner by combining multiple learning devices. In order to integrate the recognition advantages of various classifiers in different scenes, this paper adopts a variety of integrated learning fusion methods, in which the ensemble selection method based on BAGGING(Bootstrap AGGregat frame makes the classification performance of ASC tasks obviously improved.
【學(xué)位授予單位】:哈爾濱工業(yè)大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類號(hào)】:TP18;TN912.3
【參考文獻(xiàn)】
相關(guān)碩士學(xué)位論文 前2條
1 史秋瑩;基于深度學(xué)習(xí)和遷移學(xué)習(xí)的環(huán)境聲音識(shí)別[D];哈爾濱工業(yè)大學(xué);2016年
2 陳晨;I-VECTOR說話人識(shí)別中基于偏最小二乘的總變化空間估計(jì)方法[D];哈爾濱工業(yè)大學(xué);2015年
,本文編號(hào):1611162
本文鏈接:http://sikaile.net/kejilunwen/xinxigongchenglunwen/1611162.html
最近更新
教材專著