Generating Action Description Text From Skeleton Key Points Sequence
Özet
Although numerous sign language datasets are available, they generally include only a small portion of the thousands of signs used worldwide. Furthermore, the creation of diverse sign language datasets is both costly and difficult, largely due to the expenses involved in assembling a diverse group of signers. Driven by these challenges, we set out to devise a solution that overcomes these constraints. To achieve our goal, we shaped the main framework as a system that produces text from the skeleton and skeleton from the text and can feed each other in a cyclic manner. Our motivation stemmed from the question of whether we could generate skeletons for thousands of signs in an unsupervised and efficient manner. Within this framework, we concentrated on creating textual descriptions of body movements from sequences of skeleton keypoints, which resulted in the development of a new dataset. This dataset was based on AUTSL, a detailed dataset of isolated Turkish sign language. Additionally, we created a baseline model called SkelCap, designed to generate textual descriptions of body movements. This model processes skeleton keypoint data as vectors, applies a fully connected layer for embedding, and uses a transformer neural network for sequence-to-sequence modeling. We extensively evaluated our model, performing both signer-agnostic and sign-agnostic assessments. The model delivered promising results, achieving a ROUGE-L score of 0.98 and a BLEU-4 score of 0.94 in the signer-agnostic evaluation. After these promising results, we focused on producing skeletons from text. For this purpose, studies including adversarial training were carried out, but successful results were not achieved within the duration of this thesis. The dataset we developed, called AUTSL-SkelCap, will be publicly accessible soon.