İçerik Tabanlı Atıf Analizi Modeli Tasarımı: Türkçe Atıflar için Metin Kategorizasyonuna Dayalı Bir Uygulama

View/ Open
Date
2017Author
Taşkın, Zehra
xmlui.mirage2.itemSummaryView.MetaData
Show full item recordAbstract
One of the important components of measuring research/er performance is to evaluate the number of citations received. Academics receive incentives, promotions or rewards contingent to the number of these citations. Although the initial purpose for the recording of citations was to determine the publications related to one another, the use of citations also changed as a result of these new contingents. This situation sometimes brings about unethical practices such as the manipulation of the number of citations. Consequently, there emerged the necessity of conducting content analysis of citations in addition to quantitative evaluations.
The main aim of this study is to design a content-based citation analysis model for Turkish citations. For this end, 423-refereed articles, which have been published in library and information science literature in Turkey, are thoroughly examined. Firstly, all metadata, references, and full-texts of the articles are stored in a database to create a content-based citation analysis model. A total of 12,881 references and 101,019 sentences have been stored in this database. Then, the main taxonomic categories have been determined and citation sentences have been classified into these categories by tagging them with inter-annotator agreement process. At the last stage, the performance of the classification has been tested by using Weka software and a content-based citation analysis model is presented considering these performance ratios.
In this study, citations are divided into four main categories: citations in terms of meanings, citations in terms of purposes, citations in terms of shapes, and citations in terms of arrays. Then, each category is divided into sub-categories. The sub-categories are positive, negative, and neutral citations for meaning; literature, definition, method, data and data validation for purpose; mentioning author name, multiple citations in single sentence, and citation using direct quotations for shape. In evaluating the citations in terms of arrays, the sections of citations (introduction, method etc.), the number of use, and the number of citations in different sections in the texts have been considered.
In the categorization of citations by the machine, 1-2 gram word tokenizer has been chosen as the word preprocessing method and the application is run with the stop words preserved. The main reason for this is that the stop words have importance in the determination of citation classes. Following the word preprocessing, the performance of the classification has been tested with the Weka software and over 90% performance is achieved for all three main categories.
Naive Bayes Multinomial algorithm is used to classify citations in terms of meanings (performance ratio is 96.5%) and purposes (performance ratio is 90.4%). It has been found that the lowest achievement in classifying citations in terms of meaning is in determining negative citations. This finding confirms the argument that authors make negative citations with more allusive words. Studies in the literature show that when different applications of the natural language processing are added to the analysis (such as sentiment dictionary or parsing), the performance can be improved. Success rates may be increased by adding various analyses in future studies. According to the results of citation purpose classification, the best performance has been determined for data validation citations, while the lowest performance has been detected for method and definition citations. The main reason for this is thought to be the definitions made when explaining the method. Random Forests algorithm has been used for the classification of citations in terms of shapes and the algorithm has been able to classify citations with the success rate of 92%. Highest achievement has been determined for citations with author names, while the lowest performance has been calculated for the citations indicated in quotation marks.
The results show that in the Turkish library and information science literature citation sentences are generally placed in introduction and literature review sections (85%), and negative and data validation citations are seen in the findings and conclusions sections. Additionally, citations by using the name of cited authors are generally found in conclusion sections. It is determined that 67% of the references are cited only once in the texts, and 6% are not cited in the texts at all. In addition, 1% of the citations in the texts are not found in the reference lists. This suggests that writers and editors should be more careful when citing and editing the papers.
In this study, the fundamental points to be taken into consideration during citation evaluation processes by researchers, editors, and managers/decision makers are presented by content-based citation analysis model. With this model, the tasks ideally assigned to each role in the scholarly communication process are also defined. The most important issue at this point is a realization on the part of all parties involved that the meaning of a citation is not the same in every case. Once such an awareness is in place, it may be possible to minimize the manipulations done through the citations.
URI
http://www.bby.hacettepe.edu.tr/akademik/zehrataskin/file/ZT_PhD_web.pdfhttp://hdl.handle.net/11655/3486