Monocular Depth Estimation with Self-Supervised Representation Learning

Şentürk, Ufuk Umut

dc.contributor.advisor	İkizler Cinbiş, Nazlı
dc.contributor.author	Şentürk, Ufuk Umut
dc.date.accessioned	2023-07-03T09:01:25Z
dc.date.issued	2022
dc.date.submitted	2022-09-19
dc.identifier.uri	https://hdl.handle.net/11655/33514
dc.description.abstract	Many representation and modalities are developed for better scene understanding as images, videos, point clouds, etc. In this thesis, we intentionally characterize scene representation as depth maps in order to leverage rich 3D information and to develop strong priors over the scene. Gathering ground truth for depth estimation task is burdensome. To alleviate this supervision, novel view synthesis is employed as a proxy task to solve the depth estimation task within the Structure-from-motion (SfM) framework. Besides, self-supervised representation learning for depth estimation is not studied extensively, and the current state of self-supervised representation learning signals that there will be no dependence on ground truth annotations for training at all. Combining two paradigms is a way of improving representations for better scene understanding that leads to better practical developments. Specifically, we propose {\em TripleDNet (Disentangled Distilled Depth Network)}, a multi-objective, distillation-based framework for purely self-supervised depth estimation. Structure-from-motion-based depth prediction models utilize self-supervision while processing consecutive frames in a monocular depth estimation manner. Static world and illumination constancy assumptions do not hold and allow wrong signals to the training procedure, leading to poor performance. Masking out those parts hurts the integrity of the image structure. In order to compensate side effects of previous approaches, we add further objectives to SfM based estimation to constrain the solution space and to allow feature space disentanglement within an efficient and simple architecture. In addition, we propose a knowledge distillation objective that benefits depth estimation in terms of scene context and structure. Surprisingly, we also found out that self-supervised image representation learning frameworks for model initialization outperforms supervised counterparts. Experimental results show that proposed models trained purely in a self-supervised fashion outperform state-of-the-art models on the KITTI and Make3D datasets compared to models utilizing ground truth segmentation maps and feature metric loss compared to supervised counterparts. Experimental result shows that models trained without any ground truth knowledge, or with any prior based on ground truth, outperform models on the KITTI and Make3D datasets on many metrics.	tr_TR
dc.language.iso	en	tr_TR
dc.publisher	Fen Bilimleri Enstitüsü	tr_TR
dc.rights	info:eu-repo/semantics/openAccess	tr_TR
dc.subject	Self-supervised representation learning	tr_TR
dc.subject	Scene representation
dc.subject	Depth estimation
dc.subject	Deep learning
dc.subject	Computer vision
dc.title	Monocular Depth Estimation with Self-Supervised Representation Learning	tr_TR
dc.type	info:eu-repo/semantics/masterThesis	tr_TR
dc.description.ozet	Sahne bağlamını anlamak için, görüntüler, videolar vb. gibi birçok temsil ve modalite geliştirilmiştir. Zengin 3B bilgileri içerdiğinden ve sahne hakkında güçlü önceliklere sahip olduğundan, sahne temsilini derinlik haritaları olarak çıkarmak pratik olarak bir çok avantaj sağlamaktadır. Derinlik tahmini görevi için kesin referans derinlik haritalarını toplamak külfetli bir eylemdir. Bu nedenle, yeni görütü sentezleme, Hareketten-Yapı çerçevesinde derinlik tahmini görevini çözmek için bir vekil görev olarak kullanılır. Ayrıca, derinlik tahmini için öz-denetimli temsil öğrenimi kapsamlı bir şekilde çalışılmamıştır ve kendi kendini denetleyen temsil öğreniminin mevcut durumu, eğitim için kesin referans hiç gerek olmayacağının sinyallerini vermektedir. İki paradigmayı birleştirmek, daha iyi pratik gelişmelere yol açan daha iyi sahne anlayışı için daha iyi temsil yaratmanın bir yoludur. Bu çalışmada, tamamen öz-denetimli derinlik tahmini için çok amaçlı, damıtma tabanlı bir çerçeve olan {\em TripleDNet (Disentangled Distilled Depth Network)} öneriyoruz. Harekete dayalı yapı tabanlı derinlik tahmin modelleri, ardışık kareleri monoküler derinlik tahmini tarzında işlerken kendi öz-denetlemeyi yapar. Fakat, statik dünya ve aydınlatma sabitliği varsayımları gerçek dünyada kırılacağı için eğitim prosedürüne yanlış sinyaller verilmesine izin verir, bu da düşük performansa yol açar. Ayrıca bu kısımların maskelenmesi görüntü yapısının bütünlüğüne zarar vermektedir. Çözüm alanını sınırlamak ve etkin, basit bir mimari içinde özellik uzayının çözülmesine izin vermek için SfM tabanlı tahmine ek olarak başka objektifler ekliyoruz. Ek olarak, sahne bağlamı ve yapısı açısından derinlik tahminine fayda sağlayan bir bilgi damıtma yaklaşımı da öneriyoruz. Şaşırtıcı bir şekilde, model başlatma için öz-denetimli görüntü temsili öğrenme çerçevelerinin, kesin referansla denetlenen benzerlerinden daha iyi performans gösterdiğini de keşfettik. Deneysel sonuçlar, tamamen öz-denetimli bir şekilde eğitilmiş önerilen modellerin, KITTI ve Make3D veri kümelerinde son teknoloji modellerden, ve kesin referans olarak segmentasyon haritalarını kullanan modellere kıyasla daha iyi performans göstermektedir.	tr_TR
dc.contributor.department	Bilgisayar Mühendisliği	tr_TR
dc.embargo.terms	Acik erisim	tr_TR
dc.embargo.lift	2023-07-03T09:01:26Z
dc.funding	Yok	tr_TR

Files in this item

Name:: Hacettepe_Thesis (5).pdf
Size:: 7.449Mb
Format:: PDF
Description:: Msc Tez

View/Open

This item appears in the following Collection(s)

Bilgisayar Mühendisliği Bölümü Tez Koleksiyonu [212]

Show simple item record