Data Augmentation For Natural Language Processing

Çataltaş, Mustafa

Göster/Aç

cataltas-msthesis_final -online.pdf (3.614Mb)

Tarih

2024

Yazar

Çataltaş, Mustafa

Ambargo Süresi

Acik erisim

publications

supporting

mentioning

contrasting

Smart Citations

Citing PublicationsSupportingMentioningContrasting

View Citations

See how this article has been cited at scite.ai

scite shows how a scientific paper has been cited by providing the context of the citation, a classification describing whether it supports, mentions, or contrasts the cited claim, and a label indicating in which section the citation was made.

Üst veri

Tüm öğe kaydını göster

Özet

Advanced deep learning models have greatly improved various natural language processing tasks. While they perform best with abundant data, acquiring large datasets for each task is not always easy. Therefore, by using data augmentation techniques, comprehensive data sets can be obtained by creating synthetic samples from existing data. This thesis undertakes an examination of the efficacy of autoencoders as a textual data augmentation technique targeted at improving the performance of classification models in text classification tasks. The analysis encompasses the comparison of four distinct autoencoder types: Traditional Autoencoder (AE), Adversarial Autoencoder (AAE) Denoising Adversarial Autoencoder (DAAE) and Variational Autoencoder (VAE). Moreover, the study investigates the impact of different word embedding types, preprocessing methods, label-based filtering, and the number of epochs for training on the performance of autoencoders. Experimental evaluations are conducted using the SST-2 sentiment classification dataset, consisting of 7791 training instances. For data augmentation experiments, subsets of 100, 200, 400, and 1000 randomly selected instances from this dataset were employed. Experimental evaluations involved augmenting data at ratios of 1:1, 1:2, 1:4, and 1:8 when working with small datasets. Comparative analysis with baseline models demonstrates the superiority of AE-based data augmentation methods at a 1:1 augmentation ratio. These findings underscore the effectiveness of using autoencoders as data augmentation methods for optimizing text classification performance in NLP applications.

Bağlantı

https://hdl.handle.net/11655/35960

Koleksiyonlar

Bilgisayar Mühendisliği Bölümü Tez Koleksiyonu [267]