Monocular Depth Estimation with Self-Supervised Representation Learning
View/ Open
Date
2022Author
Şentürk, Ufuk Umut
xmlui.dri2xhtml.METS-1.0.item-emb
Acik erisimxmlui.mirage2.itemSummaryView.MetaData
Show full item recordAbstract
Many representation and modalities are developed for better scene understanding as images, videos, point clouds, etc. In this thesis, we intentionally characterize scene representation as depth maps in order to leverage rich 3D information and to develop strong priors over the scene. Gathering ground truth for depth estimation task is burdensome. To alleviate this supervision, novel view synthesis is employed as a proxy task to solve the depth estimation task within the Structure-from-motion (SfM) framework. Besides, self-supervised representation learning for depth estimation is not studied extensively, and the current state of self-supervised representation learning signals that there will be no dependence on ground truth annotations for training at all. Combining two paradigms is a way of improving representations for better scene understanding that leads to better practical developments. Specifically, we propose {\em TripleDNet (Disentangled Distilled Depth Network)}, a multi-objective, distillation-based framework for purely self-supervised depth estimation. Structure-from-motion-based depth prediction models utilize self-supervision while processing consecutive frames in a monocular depth estimation manner. Static world and illumination constancy assumptions do not hold and allow wrong signals to the training procedure, leading to poor performance. Masking out those parts hurts the integrity of the image structure. In order to compensate side effects of previous approaches, we add further objectives to SfM based estimation to constrain the solution space and to allow feature space disentanglement within an efficient and simple architecture. In addition, we propose a knowledge distillation objective that benefits depth estimation in terms of scene context and structure. Surprisingly, we also found out that self-supervised image representation learning frameworks for model initialization outperforms supervised counterparts. Experimental results show that proposed models trained purely in a self-supervised fashion outperform state-of-the-art models on the KITTI and Make3D datasets compared to models utilizing ground truth segmentation maps and feature metric loss
compared to supervised counterparts. Experimental result shows that models trained without any ground truth knowledge, or with any prior based on ground truth, outperform models on the KITTI and Make3D datasets on many metrics.