A Plagiarism Detection System Based on POS Tag N-Grams
Özet
It is a common problem to find similar parts in two different documents or texts. Especially, a text suspected of plagiarism is likely to have similar characteristics with the source text. Plagiarism is defined as taking some or all of the writings of other people and showing them as their own, or expressing the ideas of others in different ways without citing the source. Today, it is observed that there is an increase in plagiarism cases with the development of technology. Therefore, in order to prevent plagiarism, various plagiarism detection programs have been used in universities and principles regarding plagiarism and scientific ethics have been added to education regulations.
In this thesis, a novel method for detecting external plagiarism is proposed. Both syntactic and semantic similarity features were used to identify the plagiarized parts of the text. Part-of-speech (POS) tags are used to identify the plagiarized sections of suspicious texts and the original sections corresponding to these sections in the source texts. Each source sentence is indexed by a search engine according to its POS tag n-grams to access possible plagiarism candidate sentences rapidly. Suspicious sentences that converted to their POS tag n-grams are used as query to access source sentences. The search engine results returned from the queries enable to detect plagiarized parts of the suspicious document. The semantic relationship between two given words is calculated with Word2Vec, which is a method for using word embeddings. On the other hand, the longest common subsequence (LCS) algorithm is applied to calculate semantic similarity at the sentence level.
In this thesis, PAN-PC-11 dataset, which was created to evaluate automated plagiarism detection algorithms, is used. The tests are carried out with different parameters and threshold values to evaluate the diversity of the results. According to the experimental results with this dataset, the proposed method achieved the best performance in low and high obfuscation plagiarism cases compared to the plagiarism detection systems in the 3rd International Plagiarism Detection Competition (PAN11).