Learning Sign Languages with Limited Supervision and Semantic Representations
Özet
Sign languages evolve and change over time, much like spoken languages. It is impractical to annotate the signs of a sign language for supervised learning. Scalable recognition modalities are needed for SLR especially for those signs that have very few or even none annotated examples. In this thesis, we tackle three novel problems that involve limited supervision for sign language recognition. These are zero-shot sign language recognition (ZSSLR), generalized zero-shot sign language recognition (GZSSLR), and few-shot sign language recognition.
The idea in ZSSLR is to use models learned over the seen sign classes to recognize instances of unseen sign classes. In GZSSLR, the learned model is evaluated not only on unseen sign classes but also on seen sign classes. In this context, freely available textual and attribute sign descriptions from sign language descriptions are used as semantic class representations for knowledge transfer. We have processed textual and attribute data and we have collected those auxiliary information for sign language signs. In this thesis, we have provided three benchmark datasets with their supporting text and attribute descriptions to address the challenge for these two innovative problem scenarios in depth.
In order to address (G)ZSSLR task, we propose two methods that construct spatio-temporal models of body and hand regions. The feature construction process consists of two steps: (i) a pre-trained three-dimensional convolutional neural network is used for extraction to capture short-term dynamics with short video snippets of a sign video, (ii) longer-term dynamics are captured using these features with recurrent neural networks. We show that text and attribute-based class definitions, together with the spatio-temporal models of body and hand, provide effective information for recognizing of previously unseen sign classes inside a zero-shot learning framework.
We additionally propose two techniques to investigate the impact of binary attributes on correct and incorrect zero-shot predicti ons. A flip difference operator is defined to estimate the impact of each attribute on classification. We hope that the methods and datasets presented in this thesis will serve as a foundation for ongoing research on zero-shot learning in sign language recognition.
In addition, we approach the problem of data scarcity of sign language recognition from different perspective: few-shot meta learning, where the goal is to recognize novel sign classes, each of which has only a few labeled class samples. This is the case when some sign classes have more annotated samples than others due to their widespread use in daily life. Our approach is to generalize a model from sub-tasks by training a model with the task specific data in a supervised learning manner to recognize novel sign classes, which in turn contain few related ground-truth annotated examples.
Our experimental results over all these three tasks and detailed analysis show that the proposed methods are effective in recognizing both seen and also unseen sign class examples. We anticipate that the presented methods and datasets will serve as a foundation for further research in scalable sign language recognition.