Deep Learnıng Archıtectures for Collectıve Actıvıty Recognıtıon
View/ Open
Date
2019-09-30Author
Zalluhoğlu, Cemil
xmlui.dri2xhtml.METS-1.0.item-emb
Açıkxmlui.mirage2.itemSummaryView.MetaData
Show full item recordAbstract
Collective activity recognition, which analyses the behavior of groups of people in videos, is an essential goal of video surveillance systems. In this thesis, we proposed three new solutions and one novel dataset for the collective activity recognition task. In the first method, we propose a new multi-stream convolutional neural network architecture that utilizes information extracted from multiple regions. The proposed method is the first work that uses a multi-stream network and multiple regions in this problem. Various strategies to fuse multiple spatial and temporal streams are explored. We evaluate the proposed method on two benchmark datasets, the Collective Activity Dataset, and the Volleyball Dataset. Our experimental results show that the proposed method improves collective activity recognition performance when compared to the state-of-the-art approaches.
In trying to solve this problem, we realized that the existing datasets are insufficient for deep learning methods and have many limitations. Then we introduce the ”Collective Sports (C-Sports)” dataset, which is a novel benchmark dataset for multi-task recognition of both collective activity and sports categories. Various state-of-the-art techniques are evaluated on this dataset, together with a multi-task variant, which demonstrates increased performance. From the experimental results, we can say that while the sports action category recognition is relatively an easy task, there is still room for improvement for collective activity recognition, especially for the distant view situations. We believe that C-Sports dataset will stir further interest in this research direction.
Our second proposed method involves an attention mechanism. We utilize the soft attention-based attention mechanism for action recognition and collective activity recognition tasks. We use attention maps that have high response values to regions that need attention in videos. We describe a method that using this attention mechanism with two distinct 3D-ConvNets architectures which are standard 3D-ConvNets (C3D) and inflated 3D-ConvNets (I3D). We evaluate our method on four benchmark datasets; two of them are about action recognition task, UCF101, and HMDB51. Others are related to collective activity recognition problem, Collective Activity Dataset, and Collective Sports Dataset. Experimental results show that the 3D attention-based ConvNets improves the performance on all datasets when compared to baselines which are 3D-ConvNets architectures without an attention mechanism.
Our last proposed method of this thesis involves relation reasoning method and 3D attention mechanism. We propose a 3D Spatio-temporal relation network. We create this architecture by adding new methods step by step on the base Temporal Relation Network (TRN). First, a 2D attention mechanism has been added to TRN architecture. Then the 2D architecture is moved into 3D space. Finally, a 3D attention mechanism has been added on 3D TRN architecture. We evaluate these networks on one activity recognition and three collective activity recognition datasets, Something-Something v1, Collective Activity Recognition, Collective Sports, and Volleyball datasets, respectively. Our results show that the methods with attention mechanism improve the recognition performance. Besides, 3D networks obtain better accuracy when compared to 2D networks.