基于深度學(xué)習(xí)的單目圖像深度估計(jì)
發(fā)布時(shí)間:2019-06-08 12:08
【摘要】:3D場(chǎng)景解析是計(jì)算機(jī)視覺(jué)領(lǐng)域一個(gè)重要的研究課題,而深度估計(jì)是理解場(chǎng)景的3D幾何關(guān)系的重要方法。在許多計(jì)算機(jī)視覺(jué)任務(wù)中,與只使用RGB圖像的情況相比,額外地融入相對(duì)準(zhǔn)確可靠的深度信息能夠較大地提升算法的性能,例如語(yǔ)義分割,姿態(tài)估計(jì)及目標(biāo)檢測(cè)。傳統(tǒng)的單目圖像深度估計(jì)方法都基于光學(xué)幾何約束或一些環(huán)境假設(shè),例如運(yùn)動(dòng)中恢復(fù)結(jié)構(gòu),焦點(diǎn)或者光照變化等。然而,在缺少以上約束或假設(shè)的情況下,研究出一個(gè)能夠僅根據(jù)一幅單目圖像的信息精確地估計(jì)深度的計(jì)算機(jī)視覺(jué)系統(tǒng),是一項(xiàng)極具挑戰(zhàn)的任務(wù)。該任務(wù)有以下兩大難點(diǎn):其一是一般的計(jì)算機(jī)視覺(jué)系統(tǒng)很難像人類(lèi)的大腦一樣從單目圖像中抓取到充足的可用以推測(cè)3D結(jié)構(gòu)的信息;其二是該任務(wù)本身是一個(gè)病態(tài)問(wèn)題,即一張二維圖像對(duì)應(yīng)無(wú)窮多種真實(shí)的3D場(chǎng)景。這種將單幅圖像映射到深度圖的固有的不確定性決定了視覺(jué)模型不可能僅憑單幅圖像估計(jì)出精確的深度值。針對(duì)這兩個(gè)難題,本文分別提出了以下方法:首先,本文提出了一個(gè)將卷積神經(jīng)網(wǎng)絡(luò)與條件隨機(jī)場(chǎng)統(tǒng)一于一個(gè)深度學(xué)習(xí)框架內(nèi)的計(jì)算機(jī)視覺(jué)模型。卷積神經(jīng)網(wǎng)絡(luò)能夠提取豐富的相關(guān)特征,條件隨機(jī)場(chǎng)則可根據(jù)像素的位置與顏色信息對(duì)卷積網(wǎng)絡(luò)輸出進(jìn)行優(yōu)化;其次,針對(duì)這一問(wèn)題的病態(tài)性,本文提出了一個(gè)融合稀疏已知標(biāo)簽的視覺(jué)模型,該模型以已獲得的一些相對(duì)精確的深度值為參考,較大地減少了其他像素點(diǎn)上合理深度值的搜索范圍,從而使模型在一定的程度上減少了RGB圖像到深度圖之間映射的不確定性?偠灾,本文提供了從單目圖像估計(jì)深度的最新研究進(jìn)展,包括相關(guān)的數(shù)據(jù)庫(kù),研究方法及其性能。對(duì)單目圖像深度估計(jì)存在的問(wèn)題以及未來(lái)的發(fā)展方向做出了分析與討論。同時(shí),提出了一種從單目圖像中學(xué)習(xí)深度信息特征表達(dá)的計(jì)算機(jī)視覺(jué)模型?紤]到該問(wèn)題的病態(tài)性,又提出了一種融合稀疏已知標(biāo)簽的視覺(jué)模型,減少了單目圖像與深度圖之間的映射的不確定性。并且,在NYU Depth v2數(shù)據(jù)集上驗(yàn)證了以上兩個(gè)視覺(jué)模型的有效性與優(yōu)越性。
[Abstract]:3D scene analysis is an important research topic in the field of computer vision, and depth estimation is an important method to understand the 3D geometric relationship of scene. In many computer vision tasks, the extra integration of relatively accurate and reliable depth information can greatly improve the performance of the algorithm, such as semantic segmentation, attitude estimation and target detection, compared with the use of RGB images only. Traditional monocular image depth estimation methods are based on optical geometric constraints or some environmental assumptions, such as restoration structure in motion, focus or light change, and so on. However, in the absence of the above constraints or assumptions, it is a challenging task to develop a computer vision system which can accurately estimate the depth according to the information of only one monocular image. The task has the following two difficulties: one is that the general computer vision system is difficult to capture enough information from monocular images like the human brain to infer 3D structure; The other is that the task itself is a morbid problem, that is, a two-dimensional image corresponds to infinitely many real 3D scenes. The inherent uncertainty of mapping a single image to a depth map determines that the visual model can not estimate the exact depth value only from a single image. In order to solve these two problems, the following methods are proposed in this paper: firstly, a computer vision model which unifies convolution neural network and conditional random field into a deep learning framework is proposed. Convolution neural network can extract rich related features, and conditional random field can optimize the output of convolution network according to the position and color information of pixels. Secondly, in view of the pathological nature of this problem, a visual model combining sparse known tags is proposed in this paper, which is based on some relatively accurate depth values obtained. The search range of reasonable depth value on other pixels is greatly reduced, so that the model reduces the uncertainty of mapping between RGB image and depth map to a certain extent. In a word, this paper provides the latest research progress of depth estimation from monocular images, including related databases, research methods and performance. The problems existing in depth estimation of monocular images and the development direction in the future are analyzed and discussed. At the same time, a computer vision model for learning depth information feature representation from monocular images is proposed. Considering the pathological nature of the problem, a visual model combining sparse known tags is proposed, which reduces the uncertainty of mapping between monocular images and depth maps. Moreover, the effectiveness and superiority of the above two visual models are verified on the NYU Depth v2 dataset.
【學(xué)位授予單位】:哈爾濱理工大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類(lèi)號(hào)】:TP391.41;TP18
本文編號(hào):2495273
[Abstract]:3D scene analysis is an important research topic in the field of computer vision, and depth estimation is an important method to understand the 3D geometric relationship of scene. In many computer vision tasks, the extra integration of relatively accurate and reliable depth information can greatly improve the performance of the algorithm, such as semantic segmentation, attitude estimation and target detection, compared with the use of RGB images only. Traditional monocular image depth estimation methods are based on optical geometric constraints or some environmental assumptions, such as restoration structure in motion, focus or light change, and so on. However, in the absence of the above constraints or assumptions, it is a challenging task to develop a computer vision system which can accurately estimate the depth according to the information of only one monocular image. The task has the following two difficulties: one is that the general computer vision system is difficult to capture enough information from monocular images like the human brain to infer 3D structure; The other is that the task itself is a morbid problem, that is, a two-dimensional image corresponds to infinitely many real 3D scenes. The inherent uncertainty of mapping a single image to a depth map determines that the visual model can not estimate the exact depth value only from a single image. In order to solve these two problems, the following methods are proposed in this paper: firstly, a computer vision model which unifies convolution neural network and conditional random field into a deep learning framework is proposed. Convolution neural network can extract rich related features, and conditional random field can optimize the output of convolution network according to the position and color information of pixels. Secondly, in view of the pathological nature of this problem, a visual model combining sparse known tags is proposed in this paper, which is based on some relatively accurate depth values obtained. The search range of reasonable depth value on other pixels is greatly reduced, so that the model reduces the uncertainty of mapping between RGB image and depth map to a certain extent. In a word, this paper provides the latest research progress of depth estimation from monocular images, including related databases, research methods and performance. The problems existing in depth estimation of monocular images and the development direction in the future are analyzed and discussed. At the same time, a computer vision model for learning depth information feature representation from monocular images is proposed. Considering the pathological nature of the problem, a visual model combining sparse known tags is proposed, which reduces the uncertainty of mapping between monocular images and depth maps. Moreover, the effectiveness and superiority of the above two visual models are verified on the NYU Depth v2 dataset.
【學(xué)位授予單位】:哈爾濱理工大學(xué)
【學(xué)位級(jí)別】:碩士
【學(xué)位授予年份】:2017
【分類(lèi)號(hào)】:TP391.41;TP18
【參考文獻(xiàn)】
相關(guān)期刊論文 前4條
1 馮春;吳小鋒;尹飛鴻;楊名利;;基于局部特征匹配的雙焦單目立體視覺(jué)深度估計(jì)[J];計(jì)算機(jī)技術(shù)與發(fā)展;2016年10期
2 許路;趙海濤;孫韶媛;;基于深層卷積神經(jīng)網(wǎng)絡(luò)的單目紅外圖像深度估計(jì)[J];光學(xué)學(xué)報(bào);2016年07期
3 明英;蔣晶玨;明星;;基于柯西分布的單幅圖像深度估計(jì)[J];武漢大學(xué)學(xué)報(bào)(信息科學(xué)版);2016年06期
4 江靜;張雪松;;基于計(jì)算機(jī)視覺(jué)的深度估計(jì)方法[J];光電技術(shù)應(yīng)用;2011年01期
,本文編號(hào):2495273
本文鏈接:http://sikaile.net/kejilunwen/zidonghuakongzhilunwen/2495273.html
最近更新
教材專著