融合圖像場景及物體先驗知識的圖像描述生成模型

發(fā)布時間：2019-04-10 19:13

【摘要】：目的目前基于深度卷積神經(jīng)網(wǎng)絡(luò)(CNN)和長短時記憶(LSTM)網(wǎng)絡(luò)模型進行圖像描述的方法一般是用物體類別信息作為先驗知識來提取圖像CNN特征,忽略了圖像中的場景先驗知識,造成生成的句子缺乏對場景的準確描述,容易對圖像中物體的位置關(guān)系等造成誤判。針對此問題,設(shè)計了融合場景及物體類別先驗信息的圖像描述生成模型(F-SOCPK),將圖像中的場景先驗信息和物體類別先驗信息融入模型中,協(xié)同生成圖像的描述句子,提高句子生成質(zhì)量。方法首先在大規(guī)模場景類別數(shù)據(jù)集Place205上訓練CNN-S模型中的參數(shù),使得CNN-S模型能夠包含更多的場景先驗信息,然后將其中的參數(shù)通過遷移學習的方法遷移到CNNd-S中,用于捕捉待描述圖像中的場景信息;同時,在大規(guī)模物體類別數(shù)據(jù)集Imagenet上訓練CNN-O模型中的參數(shù),然后將其遷移到CNNd-O模型中,用于捕捉圖像中的物體信息。提取圖像的場景信息和物體信息之后,分別將其送入語言模型LM-S和LMO中;然后將LM-S和LM-O的輸出信息通過Softmax函數(shù)的變換,得到單詞表中每個單詞的概率分值;最后使用加權(quán)融合方式,計算每個單詞的最終分值,取概率最大者所對應的單詞作為當前時間步上的輸出,最終生成圖像的描述句子。結(jié)果在MSCOCO、Flickr30k和Flickr8k 3個公開數(shù)據(jù)集上進行實驗。本文設(shè)計的模型在反映句子連貫性和準確率的BLEU指標、反映句子中單詞的準確率和召回率的METEOR指標及反映語義豐富程度的CIDEr指標等多個性能指標上均超過了單獨使用物體類別信息的模型,尤其在Flickr8k數(shù)據(jù)集上,在CIDEr指標上,比單獨基于物體類別的Object-based模型提升了9%,比單獨基于場景類別的Scene-based模型提升了近11%。結(jié)論本文所提方法效果顯著,在基準模型的基礎(chǔ)上,性能有了很大提升;與其他主流方法相比,其性能也極為優(yōu)越。尤其是在較大的數(shù)據(jù)集上(如MSCOCO),其優(yōu)勢較為明顯;但在較小的數(shù)據(jù)集上(如Flickr8k),其性能還有待于進一步改進。在下一步工作中,將在模型中融入更多的視覺先驗信息,如動作類別、物體與物體之間的關(guān)系等,進一步提升描述句子的質(zhì)量。同時,也將結(jié)合更多視覺技術(shù),如更深的CNN模型、目標檢測、場景理解等,進一步提升句子的準確率。
[Abstract]:Objective at present, the methods of image description based on the deep convolution neural network (CNN) and the long-and long-term memory (LSTM) network model are usually based on the prior knowledge of the object category information to extract the CNN features of the image. Ignoring the priori knowledge of the scene in the image, resulting in the lack of accurate description of the scene in the generated sentences, it is easy to misjudge the position relationship of the object in the image. In order to solve this problem, an image description generation model (F-SOCPK) which combines the prior information of scene and object category is designed. The scene priori information in the image and the prior information of the object category are incorporated into the model, and the description sentences of the image are generated in collaboration. Improve the quality of sentence generation. Methods first, the parameters of CNN-S model were trained on the large-scale scene data set Place205, so that the CNN-S model could contain more prior information of the scene, and then the parameters of the model were migrated to CNNd-S by the method of migration learning. For capturing scene information in an image to be described; At the same time, the parameters in the CNN-O model are trained on the large-scale object class data set Imagenet, and then transferred to the CNNd-O model to capture the object information in the image. After extracting the scene information and object information of the image, they are fed into the language model LM-S and LMO respectively, and then the output information of LM-S and LM-O is transformed by Softmax function to get the probability score of each word in the single word list. Finally, the final value of each word is calculated by using the weighted fusion method, and the word corresponding to the maximum probability is taken as the output of the current time step, and finally the description sentence of the image is generated. Results the experiment was carried out on three open datasets MSCOCO,Flickr30k and Flickr8k. The model designed in this paper can reflect the BLEU index of sentence coherence and accuracy. The METEOR index, which reflects the accuracy and recall rate of the words in the sentence, and the CIDEr index, which reflect the semantic richness, all exceed the models that use the object category information alone, especially on the Flickr8k data set and on the CIDEr index. It's 9% higher than the Object-based model based on the object category alone and nearly 11% higher than the Scene-based model based on the scene category alone. Conclusion the method presented in this paper has a remarkable effect, and its performance is greatly improved on the basis of the benchmark model, and the performance of the proposed method is superior to that of other mainstream methods. Especially on larger data sets (such as MSCOCO), its advantages are obvious, but on smaller data sets (such as Flickr8k), its performance needs to be further improved. In the next step, more visual priori information, such as action category, object-to-object relationship and so on, will be incorporated into the model to further improve the quality of the description sentence. At the same time, more visual techniques, such as deeper CNN model, target detection, scene understanding and so on, will be combined to further improve the accuracy of sentences.
【作者單位】：井岡山大學數(shù)理學院;井岡山大學流域生態(tài)與地理環(huán)境監(jiān)測國家測繪地理信息局重點實驗室;同濟大學計算機科學與技術(shù)系;井岡山大學電子與信息工程學院;
【基金】：流域生態(tài)與地理環(huán)境監(jiān)測國家測繪地理信息局重點實驗室基金項目(WE2016015) 江西省教育廳科學技術(shù)研究項目(GJJ160750,GJJ150788) 井岡山大學科研基金項目(JZ14012)~~
【分類號】：TP391.41

【相似文獻】

相關(guān)期刊論文前10條

1 周衛(wèi)東,馮其波,匡萃方;圖像描述方法的研究[J];應用光學;2005年03期

2 吳娛;趙嘉濟;平子良;杜昊翔;;基于指數(shù)矩的圖像描述[J];現(xiàn)代電子技術(shù);2013年14期

3 任越美;程顯毅;李小燕;謝玉宇;;基于概念級語義的圖像描述與識別[J];計算機科學;2008年07期

4 毛玉萃;;一種面向用戶需求的圖像描述方法[J];制造業(yè)自動化;2010年11期

5 周昌;鄭雅羽;周凡;陳耀武;;基于局部圖像描述的目標跟蹤方法[J];浙江大學學報(工學版);2008年07期

6 宮偉力;安里千;趙海燕;毛靈濤;;基于圖像描述的煤巖裂隙CT圖像多尺度特征[J];巖土力學;2010年02期

7 胡美燕,姜獻峰,柴國鐘;Hu矩在一次性輸液針圖像描述中的應用[J];中國圖象圖形學報;2005年02期

8 謝玉鵬;吳海燕;;基于AAM的人臉圖像描述與編碼[J];計算機仿真;2009年06期

9 阿木古楞,楊性愉,平子良;用變形雅可比(p=4,q=3)-傅立葉矩進行圖像描述[J];光電子·激光;2003年09期

10 于永新;馮志勇;;基于常識庫支持的圖像描述和檢索系統(tǒng)[J];計算機應用研究;2009年02期

相關(guān)博士學位論文前2條

1 梁浩然;自然圖像的視覺顯著性特征分析與檢測方法及其應用研究[D];浙江工業(yè)大學;2016年

2 湯進;基于圖理論的圖像描述與檢索方法研究[D];安徽大學;2007年

相關(guān)碩士學位論文前2條

1 鐘艾妮;人臉識別中圖像描述方法的研究[D];哈爾濱工業(yè)大學;2010年

2 陳影;基于復雜網(wǎng)絡(luò)理論的圖像描述與識別方法研究[D];安徽大學;2014年

，

本文編號：2456050

資料下載

論文發(fā)表

支付寶下載

Download by Alipay
微信下載

Download by Wechat
會員下載

Download by Member

本文鏈接：http://sikaile.net/kejilunwen/ruanjiangongchenglunwen/2456050.html

上一篇：基于計算和實驗的方法確定自主運動中時間問題的研究
下一篇：昌吉發(fā)改委合同管理系統(tǒng)的設(shè)計與實現(xiàn)

論文發(fā)表

·知網(wǎng)|萬方|維普|龍源|省級|國家級|科技核心|北大核心|南大核心CSSCI|EI|SCI|SSCI|

天堂国产午夜亚洲专区-少妇人妻综合久久蜜臀-国产成人户外露出视频在线-国产91传媒一区二区三区

融合圖像場景及物體先驗知識的圖像描述生成模型