A Plagiarism Detection System Based on POS Tag N-Grams

Yalçın, Kadir

dc.contributor.advisor	Çiçekli, İlyas
dc.contributor.author	Yalçın, Kadir
dc.date.accessioned	2022-10-20T07:57:51Z
dc.date.issued	2022
dc.date.submitted	2022-05-13
dc.identifier.citation	K. Yalcin, "A Plagiarism Detection System Based on POS Tag N-Grams, Doctoral Dissertation", Hacettepe University, Ankara, 2022.	tr_TR
dc.identifier.uri	http://hdl.handle.net/11655/26924
dc.description.abstract	It is a common problem to find similar parts in two different documents or texts. Especially, a text suspected of plagiarism is likely to have similar characteristics with the source text. Plagiarism is defined as taking some or all of the writings of other people and showing them as their own, or expressing the ideas of others in different ways without citing the source. Today, it is observed that there is an increase in plagiarism cases with the development of technology. Therefore, in order to prevent plagiarism, various plagiarism detection programs have been used in universities and principles regarding plagiarism and scientific ethics have been added to education regulations. In this thesis, a novel method for detecting external plagiarism is proposed. Both syntactic and semantic similarity features were used to identify the plagiarized parts of the text. Part-of-speech (POS) tags are used to identify the plagiarized sections of suspicious texts and the original sections corresponding to these sections in the source texts. Each source sentence is indexed by a search engine according to its POS tag n-grams to access possible plagiarism candidate sentences rapidly. Suspicious sentences that converted to their POS tag n-grams are used as query to access source sentences. The search engine results returned from the queries enable to detect plagiarized parts of the suspicious document. The semantic relationship between two given words is calculated with Word2Vec, which is a method for using word embeddings. On the other hand, the longest common subsequence (LCS) algorithm is applied to calculate semantic similarity at the sentence level. In this thesis, PAN-PC-11 dataset, which was created to evaluate automated plagiarism detection algorithms, is used. The tests are carried out with different parameters and threshold values to evaluate the diversity of the results. According to the experimental results with this dataset, the proposed method achieved the best performance in low and high obfuscation plagiarism cases compared to the plagiarism detection systems in the 3rd International Plagiarism Detection Competition (PAN11).	tr_TR
dc.language.iso	en	tr_TR
dc.publisher	Fen Bilimleri Enstitüsü	tr_TR
dc.rights	info:eu-repo/semantics/openAccess	tr_TR
dc.subject	Plagiarism detection	tr_TR
dc.subject	Natural language processing	tr_TR
dc.subject	POS tagging	tr_TR
dc.subject	Semantic similarity	tr_TR
dc.subject.lcsh	Bilgisayar mühendisliği	tr_TR
dc.title	A Plagiarism Detection System Based on POS Tag N-Grams	tr_TR
dc.type	info:eu-repo/semantics/doctoralThesis	tr_TR
dc.description.ozet	İki farklı doküman ya da metin içindeki benzer öğeleri bulma sıklıkla karşılaşılan bir problemdir. Özellikle intihal şüphesi taşıyan bir metnin, intihal yapılan kaynak metin ile benzer nitelikler taşıması olasıdır. İntihal kavramı, başka kişilere ait yazıların bazı bölümlerinin veya tamamının alınarak, kendisine aitmiş gibi gösterilmesi veya başkalarına ait fikirlerin kaynak göstermeden farklı şekillerde anlatılmasıdır Günümüzde teknolojinin gelişmesiyle birlikte, intihal vakalarında gittikçe artış olduğuna ilişkin değerlendirmeler gözlenmektedir. Bu nedenle, intihalin önüne geçmek amacıyla üniversitelerde çeşitli intihal tespit programları kullanılmaya başlanmış, eğitim ve öğretim yönetmeliklerine intihal ve bilimsel etik ile ilgili esaslar eklenmiştir. Bu tez çalışması ile harici intihal tespitine ilişkin özgün bir yöntem önerilmiştir. Metin içindeki intihal edilmiş bölümleri belirlemek için hem sözdizimsel hem de anlamsal benzerlik özelliklerinden faydalanılmıştır. Şüpheli metinlerdeki intihal edilmiş bölümleri ve kaynak metinlerde bunlara karşılık gelen orijinal bölümleri tespit etmek için sözcük türü (POS) etiketi n-gramları kullanılmıştır. Her bir kaynak cümle, olası intihal adayı cümlelere hızlı bir şekilde erişilebilmesi amacıyla bir arama motoru tarafından sözcük türü (POS) etiketi n-gramlarına göre indekslenir. Sözcük türü etiketi n-gram’larına dönüştürülen şüpheli cümleler, kaynak cümlelere erişmek için sorgu olarak kullanılır. Sorgulardan dönen arama motoru sonuçları, şüpheli belgenin intihal edilmiş bölümlerinin tespit edilmesini sağlamaktadır. Verilen iki sözcük arasındaki anlamsal ilişki sözcük temsillerini kullanma tekniği olan Word2Vec ile hesaplanır. Diğer taraftan, cümle düzeyinde anlamsal benzerliğin hesaplanması için en uzun ortak sıra (LCS) algoritması uygulanmaktadır. Bu tez çalışması kapsamında, otomatik intihal tespit algoritmalarının değerlendirilmesi için oluşturulan PAN-PC-11 adlı veri seti kullanılmıştır. Testler, sonuçların çeşitliliğini değerlendirmek amacıyla farklı parametre ve eşik değerleri ile gerçekleştirilmiştir. Bu veri seti ile yapılan test sonuçlarına göre önerilen yöntem, 3. Uluslararası İntihal Tespiti Yarışması'nda (PAN11) yer alan intihal tespit sistemlerine göre düşük ve yüksek karmaşıklığa sahip intihal durumlarında en iyi performansı elde etmiştir.	tr_TR
dc.contributor.department	Bilgisayar Mühendisliği	tr_TR
dc.embargo.terms	Acik erisim	tr_TR
dc.embargo.lift	2022-10-20T07:57:51Z
dc.funding	Yok	tr_TR
dc.subtype	software	tr_TR

Bu öğenin dosyaları:

Ad:: Thesis_A_Plagiarism_Detection_ ...
Boyut:: 2.031Mb
Biçim:: PDF
Açıklama:: Thesis File

Göster/Aç

Bu öğe aşağıdaki koleksiyon(lar)da görünmektedir.

Bilgisayar Mühendisliği Bölümü Tez Koleksiyonu [267]

Basit öğe kaydını göster