Understanding Actions in Instructional Videos

Yalçınkaya Şimşek, Özge

Göster/Aç

10546522.pdf (11.61Mb)

Tarih

2023

Yazar

Yalçınkaya Şimşek, Özge

Ambargo Süresi

Acik erisim

Üst veri

Tüm öğe kaydını göster

Özet

Despite the promising improvements in human activity recognition for fundamental actions, understanding actions in instructional web videos is a challenging research problem. Instructional web videos contain various demonstrations for daily human tasks such as cooking, repairing, or how to do something. These tasks involve multiple fine-grained and complex actions in different orders or appearances according to the people demonstrating them. In addition to the challenge of finding a robust action localization model for fine-grained actions, a vast amount of labeled data is needed. Due to the difficulties in labeling such data, studies propose self-supervised video representation learning models that leverage the narrations in the videos. By utilizing the automatically generated transcriptions obtained from the narrators, joint embedding spaces are trained to learn video-text similarities. Hence, pre-defined instructional action steps can be localized, leveraging the video-text similarity information. However, learning such models presents a drawback where background video clips containing non-action scenes can get high scores for specific action steps due to the misleading transcriptions paired with them in training. Therefore, action localization results can be impacted. Detecting harmful backgrounds and distinguishing them from action clips is essential. In this thesis, we investigate improving action localization results on instructional videos by defining actions and backgrounds with a novel representation that forces their discrimination in the joint feature space. We show that using baseline video-text model similarity score cues to describe each video clip reinforces the discrimination of action clips and backgrounds. The idea is based on the hypothesis that background video clips tend to obtain uniformly distributed low similarity scores for all action step labels, whereas action clips get high scores for specific action steps. We jointly train a binary model that encodes the actionness of each video clip with this discriminative representation and visual features. Here the actionness score defines the probability for a given video clip to contain an action. Then, we use the actionness scores of each video clip to post-process action localization and action segmentation scores of baseline video-text models. We present results on CrossTask and COIN datasets. We show that a small labeled set is sufficient for action/background discrimination learning. However, the quality of the data affects more. Thus, we investigate the effect of augmenting training data with action images collected from the web and utilize an image-text model. We show promising results for future directions along with challenges due to the bottleneck of the model and dataset.

Bağlantı

https://hdl.handle.net/11655/33449

Koleksiyonlar

Bilgisayar Mühendisliği Bölümü Tez Koleksiyonu [267]