Visual Representation Learning by Exploring Spatio-Temporal Consistency
Özet
Video representation learning is a fundamental area of research in computer vision. It
focuses on developing methods and models to encode video data into definitive and
discriminative representations that can be effectively utilized in downstream tasks, such as
video classification, action recognition, video retrieval and video captioning. At the core of
it, video representation learning seeks to understand scene, actor and actor’s relationship with
their surroundings, e.g. objects, and scene elements. In order to learn the transition of visual
content over time, a model needs to capture temporal and spatial information inherently
encoded in video sequences and to extract spatiotemporal representations that can be useful
for downstream applications.
This thesis addresses the problem of self-supervised video representation learning focused
on motion features, aiming to capture features from foreground motion with reduced reliance
on background bias. Recent successful methods often employ instance discrimination
approaches, which entail heavy computation and may lead to inefficient and exhaustive
pretraining. Although several works in literature incorporate two-stream networks [1, 2] to incorporate motion learning, this thesis work seeks for single network solutions for learning
spatiotemporal features.
To this end, we utilize the augmentation technique MAC: Mask-Augmentation teChnique.
MAC blends foreground motion using frame-difference-based based masks and sets up a
pretext task to recognize the applied transformation. By incorporating a game of predicting
the correct blending multiplier at the pretraining stage, our model is compelled to encode
motion-based features, which are then successfully transferred to downstream tasks such as
action recognition and video retrieval. Moreover, we expand our approach within a joint
contrastive framework and integrate additional tasks in the spatial and temporal domains to
further enhance representation capabilities. We seek for alternative methods to extract motion
masks and implement optical flow based motion extraction to complement frame difference
based motion extraction.
We present experimental results on action recognition and video retrieval tasks to
demonstrate that our method achieves superior performance on the UCF-101, HMDB51
and Diving-48 datasets under low-resource settings and competitive results with instance
discrimination methods under costly computation settings. We carefully design ablation
experiments to analyze learning behavior of proposed methods and contribution of each
component. Lastly, we present qualitative results to further illustrate the benefits of presented
methods. We anticipate that our augmentation technique, along with the associated pretext
and contrastive learning objectives, will lay the groundwork for future advancements in
self-supervised video representation learning.
Bağlantı
https://hdl.handle.net/11655/36629Koleksiyonlar
İlgili öğeler
Başlık, yazar, küratör ve konuya göre gösterilen ilgili öğeler.
-
Students' opinions on blended learning and its implementation in terms of their learning styles
Uğur, Benlihan; Akkoyunlu, Buket; Kurbanoğlu, Serap (2009)The purpose of this article is to examine students’ views on the blended learning method and its use in relation to the students’ individual learning style. The study was conducted with 31 senior students. Web based media ... -
Bilgi Yönetimi Bölümü Öğrencilerinin Öğrenme Stilleri
Kurbanoğlu, Serap; Akkoyunlu, Buket (TKD, 2008)Abstract Student-centered approach in education requires collecting information about the students’ individual characteristics. Learning styles as one of these individual characteristics vary from one student to the ... -
The Influence of Conventional and Distance Flipped Instruction on EFL Learners’ Self-Regulatıon Skills and Anxiety While Teaching Speaking Skills
Korkmaz, Sezen (Eğitim Bilimleri Enstitüsü, 2020)Flipped learning has emerged as an innovative approach and provided interactive opportunities for students who are rapidly adopting new technologies. In a flipped classroom, learners are required to do cognitively challenging ...