Offensive Language Detection in Turkish Twitter Data with BERT Models

Özberk, Anıl

Göster/Aç

OFFENSIVE LANGUAGE DETECTION IN TURKISH TWITTER DATA WITH BERT MODELS.pdf (1.510Mb)

Tarih

2022

Yazar

Özberk, Anıl

Ambargo Süresi

Acik erisim

Üst veri

Tüm öğe kaydını göster

Özet

As insulting statements become more frequent on online platforms, these negative statements create a reaction and disturb the peace of society. Identifying these expressions as early as possible is important to protect the victims. Offensive language detection research has been increasing in recent years. Offensive Language Identification Dataset (OLID) was introduced to facilitate research on this topic. Examples in OLID were retrieved from Twitter and annotated manually. Offensive Language Identification Task comprises three subtasks. In Subtask A, the goal is to discriminate the data as offensive or non-offensive. Data is offensive if it contains insults, threats, or profanity. Five languages datasets, including Turkish, were offered for this task. The other two subtasks focus on categorizing offense types (Subtask B) and targets (Subtask C). The last two subtasks mainly focus on English. This study explores the effects of the usage of Bidirectional Encoder Representations from Transformers (BERT) models and fine-tuning methods on offensive language detection on Turkish Twitter data. The BERT models that we use are pre-trained in Turkish corpora. Our fine-tuning methods are designed by considering the Turkish language and Twitter data. The importance of the pre-trained BERT model in a downstream task is emphasized. In addition, experiments with classical models are conducted, such as logistic regression, decision tree, random forest, and support vector machine (SVM).

Bağlantı

http://hdl.handle.net/11655/26131

Koleksiyonlar

Bilgisayar Mühendisliği Bölümü Tez Koleksiyonu [212]