Extreme Low Resolution Video Action Recognition
Özet
The rapid evolution of machine learning and deep learning approaches has enabled robust solutions for complex and computationally intensive problems, particularly in describing, identifying, and segmenting video content. The success of these solutions is made possible through a large amount of high-resolution video data. However, these high-quality video scenes contain private data about individuals and environments. While people have limited rights over this data, the storage mediums are vulnerable to cyber attacks. Also, storing and processing high-resolution videos is becoming increasingly costly. At this point, extremely low-resolution video samples (12 x 16) offer a more affordable storage cost while not containing private information. However, they also contain very limited temporal and spatial information. This thesis focuses on recognizing actions in extremely low-resolution videos and offers novel deep learning-based approaches.
In this context, we create extremely low-resolution samples of high-resolution video action recognition datasets currently used in literature, namely UCF-101 and HMDB-51. This way, we prepare a scenario comparable to existing literature. Then, to improve the limited temporal and spatial information quality of these low-resolution data, we use a scene-based super-resolution algorithm. Later on, we develop new deep learning models based on knowledge distillation to recognize actions in extremely low-resolution videos. We pre-train the teacher networks with high-resolution counterparts from the relevant datasets. We propose different and novel deep architectures for the student network, which will learn the temporal and spatial information in low resolution. In addition, we define new feature-based distillation loss functions for training these student networks. We propose cross-resolution attention modules to make the transfer of information from teacher to student network more effective during the training and present their potential uses experimentally.
Also, for solving this challenging problem, we increase the information spaces used by our unique deep learning structures. To transfer the information in the optical flow space, which models motion in video scenes, to the extreme low-resolution space in a efficient way, we develop a new teacher model. We show the effect of this new teacher structure with detailed experiments. Then, based on the fact that high-frequency information in high-resolution scenes represents spatial details, for the first time in the literature of extreme video action recognition, we use frequency space features to transfer to the student network. In our experiments, we show that our ability to recognize actions in extremely low-resolution videos achieves the state-of-the-art (SoA) level in the UCF-101 dataset and achieves competitive recognition accuracies for the HMDB-51 dataset.