Türkçe İçin Sahte Haber Tespit Modelinin Oluşturulması
Özet
The transformation of print media into online media along with the increasing use of the internet and social media as news sources have drastically altered the concept of media literacy and people's habit of getting news. Both individuals and organizations, therefore, have started to make extensive use of online news sites and social media platforms to receive news. This easier, faster, and comparatively cheaper opportunity offers comfort in terms of people’s access to information, but also creates important problems when the possible effects/impacts of news are considered. In this era, the news which disseminates at an unprecedented pace, huge size and variety has created a challenge that cannot be overcome with human power in terms of verifying or determining if it is worth reading. The tendency of people to spread ambiguous or fake news more than valid news makes the problem even more difficult.
As negative situations arising from the effects of fake news increase, social media platforms, commercial organizations, institutions and even states have started to take their own measures against this asymmetric threat. This situation motivated researchers to analyze this type of news most of which is composed of text. And accordingly, they have been trying to constitute scientific infrastructures and develop smart systems for detecting fake news. Basically, detection of textual deception is in the intersection of the Text Analysis/Mining and Natural Language Processing disciplines. The problem has recently become a field of research that requires the use of many systematics together such as big data analysis, artificial intelligence, machine learning etc.
News verification and labeling systems are designed to utilize or to detect complex and nested components like textual deception, fake content fabrication, news source, truth verification, sarcasm, crowd-sourcing etc. In particular, performance in detecting text deception is directly related with the level of development of language-specific resources and libraries. Many studies in the field of Fake News Detection are largely specific to the English language. Although some analysis and modeling can be generalized, the characteristic analysis of many languages, including Turkish, is essential to achieve correct results.
Although the developments in the last 5 years have constitued an academic knowledge for the field, there is still a lack of a scholar study in the literature on Turkish Fake News Detection. This situation makes it necessary to address the issue by emphasizing language-specific characteristics for agglutinative languages such as Turkish. The main framework and basic hypothesis of our study is based on the notion that potential fake news can be detected by hybrid methodologies using novel linguistic-oriented approaches. To this end, it is aimed to create a developable framework model on Turkish Fake News Detection. Within the scope of this thesis, studies were carried out to reach three main goals. The first of these objectives is that the data set obtained under the thesis will be presented as an open source for all disciplines. The second main goal is to develop the Turkish Fake News Lexicon, which is obtained by using this data set contributing a different perspective to the problem. Another goal targeted in the thesis is to develop a sustainable model with a hybrid approach to Turkish Fake News Detection and to present this through experimental studies.
The thesis is basically constructed in three phases. The first phase is the introductory phase; including literature review, examining the analyzes to deal with the problem, the dataset collection and all preliminary preparations for further phases. In this phase; a large collection has been created by labeling and verifying all of the data collected using news texts from the mainstream news outlets, news verification platform and manual methods. This phase covers all the preliminary procedures required to create a Turkish Fake News Lexicon aimed at developing in the second phase.
In the second phase, the data set obtained in the first phase was subjected to statistical analysis and it was aimed to develop a Turkish Fake News Language Model/Lexicon with four categories, which we named FanLexTR. For each category, term frequencies and relative occurrence frequencies were calculated and tone values of the terms, in other words, score values were assigned. Giving polarity value to the terms has also been tried. But, given the fact that assigning polarity value to the terms could confuse by omitting the fluctuations between fake and valid, we used tone values which gave us more stable results. A lexicon based textual deception detection was performed according to the FanLexTR. Considering the method used, the methodology is the first lexicon-based Fake News Detection study in the literature and quite successful in terms of its results.
In the third and the last phase, considering the specific characteristics of Turkish, the problem was approached from different perspectives, and the feature selection and feature extraction procedures for detection of fake news were carried out. In this phase, the lexicon-based approach developed in the second phase was added to the machine learning and deep learning approach as feature, and the results obtained in the second phase were developed. In this phase, performance evaluation of the obtained model obtained has been made.
Additionally, within the thesis an innovative solution proposal was made to reflect the solution of the problem to practice and a framework containing digital librarianship technology was introduced.