深度學(xué)習(xí)在音樂自動標(biāo)注中的應(yīng)用
本文選題:深度學(xué)習(xí) 切入點:音樂 出處:《北京交通大學(xué)》2017年碩士論文 論文類型:學(xué)位論文
【摘要】:在音樂標(biāo)注領(lǐng)域,傳統(tǒng)標(biāo)注模型總是遵循一種固定的方式:從一組注釋的歌曲出發(fā),這組歌曲由音頻的特征向量來表示,由此學(xué)習(xí)一系列對應(yīng)不同標(biāo)注的模型來進(jìn)行預(yù)測。這種方式存在很大冗余;另一方面,大規(guī)模數(shù)據(jù)集的出現(xiàn)為模型設(shè)計帶來了新的思路。因此,本文從近年興起的深度學(xué)習(xí)入手,結(jié)合大規(guī)模的訓(xùn)練數(shù)據(jù),探索更加簡潔和準(zhǔn)確的標(biāo)注手段。具體來說,本文首先基于自動音樂標(biāo)記研究的典型數(shù)據(jù)集:Magnatagatune數(shù)據(jù)集,對應(yīng)不同的特征輸入(梅爾頻譜圖、頻譜圖、梅爾頻譜系數(shù)、原始音頻)設(shè)計了 3種不同結(jié)構(gòu)的卷積神經(jīng)網(wǎng)絡(luò)模型,對比了它們在同一數(shù)據(jù)集(Magnatagatune數(shù)據(jù)集)上的效果,發(fā)現(xiàn)梅爾頻譜圖、原始音頻比頻譜圖以及梅爾頻譜系數(shù)等特征在音頻自動標(biāo)記上有明顯優(yōu)勢。然后我們設(shè)計了可視化模型來觀察訓(xùn)練好的模型參數(shù)中不同層的卷積核對怎樣的輸入有最強(qiáng)的響應(yīng),并且可視化了這些響應(yīng)。同時我們設(shè)計了不同深度的深度學(xué)習(xí)網(wǎng)絡(luò),在更大的數(shù)據(jù)集MSD(Million Song Dataset)的帶標(biāo)記的子數(shù)據(jù)集(last.fm)上進(jìn)行了試驗,發(fā)現(xiàn)在更大的數(shù)據(jù)集上,層數(shù)更深的模型明顯優(yōu)于淺的模型,該結(jié)果與計算機(jī)視覺領(lǐng)域的最新研究成果相吻合。同時,通過對比相同模型在不同數(shù)據(jù)集上的表現(xiàn),我們能清晰地看到數(shù)據(jù)集大小的提升對于不同深度模型效果的重要影響。本文的貢獻(xiàn)主要包括:(1)設(shè)計了多種結(jié)構(gòu)的音樂自動標(biāo)記的深度學(xué)習(xí)模型,在Magnatagatune數(shù)據(jù)集上對比了不同音頻的中低層特征作為模型輸入的效果,發(fā)現(xiàn)梅爾頻譜圖模型、原始音頻模型的效果明顯優(yōu)于頻譜圖模型、梅爾頻譜系數(shù)模型。同時,我們設(shè)計的原始音頻模型在該數(shù)據(jù)集上取得了優(yōu)于先前工作的 AUC(Area Under Curve)。(2)在更大的數(shù)據(jù)集MSD上對比了不同深度的模型的效果,發(fā)現(xiàn)深度更深的模型在更大的數(shù)據(jù)集上明顯表現(xiàn)出優(yōu)勢,同時也啟發(fā)我們數(shù)據(jù)集的大小對于發(fā)掘深度學(xué)習(xí)模型實際效果和潛力的重要影響。(3)可視化已訓(xùn)練好的模型,發(fā)現(xiàn)在梅爾頻譜圖模型中更高的卷積層中的卷積核對于頻率的響應(yīng)在一定程度上吻合了人耳聽覺系統(tǒng)音階響應(yīng)的分布。
[Abstract]:In the field of music tagging, the traditional tagging model always follows a fixed way: starting from a set of annotated songs, the songs are represented by the audio feature vector. So we learn a series of models corresponding to different annotations to predict. This method has a lot of redundancy; on the other hand, the emergence of large-scale data sets has brought new ideas for model design. This paper starts with the deep learning that has arisen in recent years, combines the large-scale training data, and explores more concise and accurate annotation methods. Specifically, this paper first based on the typical data set of automatic music tagging research:: Magnatagatune dataset. Three convolutional neural network models with different structures are designed for different feature inputs (Mel spectrum map, spectrum map, Mel spectrum coefficient, original audio frequency), and their effects on the same data set are compared. Found the Mayer spectrum, The features of the original audio ratio spectrum and Mel spectrum coefficient have obvious advantages in Audio automatic marking. Then we design a visual model to observe how the input of convolution check in different layers of the trained model parameters has the strongest response. At the same time, we designed a depth learning network of different depths, experimented with a tagged subdataset of larger data sets MSD(Million Song Datasetet.fm, and found that on larger datasets, Models with deeper layers are significantly better than those with shallow ones, and the results are consistent with the latest research in the field of computer vision. At the same time, by comparing the performance of the same model on different data sets, We can clearly see the important effect of increasing the size of data sets on the effects of different depth models. In this paper, we compare the effect of medium and low level features of different audio frequency as model input on Magnatagatune data set. We find that the effect of Mel spectrum model and original audio model is obviously better than that of spectrum chart model and Mel spectrum coefficient model. The original audio model we designed has achieved better results on this dataset than the previously worked AUC(Area Under Curve.Ni2) compared with models of different depths on the larger data set MSD. It is found that the deeper model has obvious advantages on the larger data set, and it also enlightens us that the size of the data set plays an important role in exploring the practical effect and potential of the depth learning model. It is found that the frequency response of the convolutional kernel in the higher convolution layer in the Mel spectrum model is consistent with the distribution of the scale response of the human auditory system to some extent.
【學(xué)位授予單位】:北京交通大學(xué)
【學(xué)位級別】:碩士
【學(xué)位授予年份】:2017
【分類號】:TN912.3
【相似文獻(xiàn)】
相關(guān)期刊論文 前10條
1 張爾強(qiáng);創(chuàng)建SAS數(shù)據(jù)集的技巧[J];數(shù)理醫(yī)藥學(xué)雜志;2003年01期
2 ;數(shù)據(jù)集N鄽2[J];航空材料;1959年09期
3 江海洪 ,羅長坤;首套中國數(shù)字化可視人體數(shù)據(jù)集在第三軍醫(yī)大學(xué)研制成功[J];中華醫(yī)學(xué)雜志;2003年09期
4 陳相穎;數(shù)據(jù)集記錄快速定位與篩選方法之探討[J];計量與測試技術(shù);2005年06期
5 張曉斌;魏永祥;韓德民;夏寅;李希平;原林;唐雷;王興海;;數(shù)字化耳鼻咽喉數(shù)據(jù)集的采集[J];中華耳鼻咽喉頭頸外科雜志;2005年06期
6 王宏鼎;唐世渭;董國田;;數(shù)據(jù)集成中數(shù)據(jù)集特征的檢測方法[J];中國金融電腦;2006年03期
7 張華;郁書好;;時空數(shù)據(jù)集的連接處理和優(yōu)化方法研究[J];皖西學(xué)院學(xué)報;2006年02期
8 苗卿;單立新;裘昱;;信息熵在數(shù)據(jù)集分割中的應(yīng)用研究[J];電腦知識與技術(shù)(學(xué)術(shù)交流);2007年05期
9 陳德誠;丘平珠;唐炳莉;;廣西氣象數(shù)據(jù)集設(shè)計與制作[J];氣象研究與應(yīng)用;2007年04期
10 趙鳳英;王崇駿;陳世福;;用于不均衡數(shù)據(jù)集的挖掘方法[J];計算機(jī)科學(xué);2007年09期
相關(guān)會議論文 前10條
1 田捷;;三維醫(yī)學(xué)影像數(shù)據(jù)集處理的集成化平臺[A];2003年全國醫(yī)學(xué)影像技術(shù)學(xué)術(shù)會議論文匯編[C];2003年
2 范明;魏芳;;挖掘基本顯露模式用于分類[A];第二十一屆中國數(shù)據(jù)庫學(xué)術(shù)會議論文集(技術(shù)報告篇)[C];2004年
3 冷傳良;;飛機(jī)化銑成樣板劃線數(shù)據(jù)集設(shè)計方法探索[A];第十屆沈陽科學(xué)學(xué)術(shù)年會論文集(信息科學(xué)與工程技術(shù)分冊)[C];2013年
4 孟燁;張鵬;宋大為;王雷;;信息檢索系統(tǒng)性能對數(shù)據(jù)集特性的依賴性分析[A];第十二屆全國人機(jī)語音通訊學(xué)術(shù)會議(NCMMSC'2013)論文集[C];2013年
5 段磊;唐常杰;左R,
本文編號:1575229
本文鏈接:http://sikaile.net/kejilunwen/xinxigongchenglunwen/1575229.html