Retweet Prediction on Earthquake Tweets
Date
2024-09-14Author
İnce, Sevginur
xmlui.dri2xhtml.METS-1.0.item-emb
Acik erisimxmlui.mirage2.itemSummaryView.MetaData
Show full item recordAbstract
On February 6, 2023, an earthquake centered in Kahramanmaraş killed or damaged many people. In the aftermath of these devastating earthquakes, the efficiency of communication channels in the crisis zone is of vital importance. While a few decades ago there was no indication of the existence of social media, today social media platforms have become people's main communication channels. Twitter, one of these platforms, is widely used in Turkey. Social media provides the opportunity to reach millions of people with a shared post. The amount of interaction a post receives increases the possibility of being noticed by other users on social media. In this thesis, using the tweets posted during and after the earthquake centered in Kahramanmaraş on February 6, 2023, the retweet interaction amounts were divided into two classes. These classes are 'non-low' and 'moderate-high' classes. The data was captured with Python's Snscrape Library as 38 days of data covering February 6, 2023 - March 15, 2023. The following operations were then performed respectively: Tweet text was cleaned. Spelling mistakes were corrected with the python Zemberek Module. Words were parse to their roots with Zeyrek Module. Stop words were deleted. Stop words were deleted. The dataset was simplified and IDF values of unique words in the first week tweets were calculated. Unique words were grouped according to their IDF value ranges. By adding 400 unique words from different IDF ranges to the dataset, 7 dataset versions consisting of different unique word groups were obtained. Among these sets, the word set that best represents the tweet text was investigated. The XGBoost model was used in the analysis. We also investigated the interaction type and class threshold limit that would be the best class label. The best class label was 'Retweet' and the best class distinction limit was observed as 2. The words that best represents the dataset were found to be the 400 words with the lowest IDF value. These words were added to the dataset as Binary Bag of Words. Then, classification was performed with various Deep Learning and Machine Learning models. These models are Random Forest, XGBoost, LSTM and DistilBERTurk. The XGBoost model gave the best performance. The results of the XGBoost model are as follows: Non-low class precision 0.75, recall 0.70, F1 score 0.73, Moderate-high class precision 0.72, recall 0.77, F1 score 0.74. Average accuracy 0.7340 and ROC-AUC score 0.81.