UNSUPERVISED JOINT PART-OF-SPEECH TAGGING AND STEMMING FOR AGGLUTINATIVE LANGUAGES

bölücü, necva

dc.contributor.advisor	Can Buğlalılar, Burcu
dc.contributor.author	bölücü, necva
dc.date.accessioned	2017-07-31T13:33:16Z
dc.date.available	2017-07-31T13:33:16Z
dc.date.issued	2017
dc.date.submitted	2017-06-23
dc.identifier.uri	http://hdl.handle.net/11655/3814
dc.description.abstract	Part of Speech (PoS) tagging is the task of assigning each word an appropriate part of speech tag in a given sentence regarding its syntactic role such as verb, noun, adjective etc. Various approaches have already been proposed for this task. However, the number of word forms in morphologically rich and productive agglutinative languages is theoretically infinite. This variety in word forms causes sparsity problem in the tagging task for agglutinative languages. In this thesis, we aim to deal with this problem in agglutinative languages by performing PoS tagging and stemming simultaneously. Stemming is the process of finding the stem of a word by removing its suffixes. Joint PoS tagging and stemming reduces sparsity by using stems and suffixes instead of words. Furthermore, we incorporate semantic features to capture similarity between stems and their derived forms by using neural word embeddings. In this thesis, we present a fully unsupervised Bayesian model using Hidden Markov Model (HMM) for joint PoS tagging and stemming for agglutinative languages. The results indicate that using stems and suffixes rather than full words outperforms a simple word-based Bayesian HMM model for especially agglutinative languages. Combining semantic features yields a significant improvement in stemming.	tr_TR
dc.description.sponsorship	Tubitak EEEAG-115E464.	tr_TR
dc.description.tableofcontents	ABSTRACT ÖZET ACKNOWLEDGMENTS CONTENTS FIGURES TABLES ABBREVIATIONS 1. INTRODUCTION 1.1. Overview 1.2. Motivation 1.3. Research Questions 2. BACKGROUND 2.1. Linguistic Background 2.2. Machine Learning Background 2.3. Inference 2.4. Conclusion 3. RELATED WORK 3.1. Introduction 3.2. Literature Review on Unsupervised Part of Speech Tagging 3.3. Literature Review of Cooperative Learning of Part of Speech Tagging 3.4. Literature Review on Stemming 3.5. Conclusion 4. MODEL 4.1. Introduction 4.2. Baseline Bayesian HMM Model 4.3. Joint Models for PoS Tagging and Stemming 5. EXPERIMENTS AND RESULTS 5.1. Datasets 5.2. Evaluation Metrics 5.3. Experiments 5.4. Conclusion 6. CONCLUSION 6.1. Conclusion 6.2. Future Research Directions A APPENDIX : PoS TAGSET REDUCTION B APPENDIX : Word2vec DATA C APPENDIX : RESULTS FOR 12K DATASETS REFERENCES	tr_TR
dc.language.iso	en	tr_TR
dc.publisher	Fen Bilimleri Enstitüsü	tr_TR
dc.rights	info:eu-repo/semantics/openAccess	tr_TR
dc.subject	NLP, doğal dil işleme, HMM, POS tagging, stemming, joint learning	tr_TR
dc.title	UNSUPERVISED JOINT PART-OF-SPEECH TAGGING AND STEMMING FOR AGGLUTINATIVE LANGUAGES	tr_TR
dc.title.alternative	SONDAN EKLEMELİ DİLLERDE GÖZETİMSİZ EŞZAMANLI SÖZCÜK TÜRÜ İŞARETLEME VE GÖVDELEME	tr_TR
dc.type	info:eu-repo/semantics/masterThesis	tr_TR
dc.description.ozet	Sözcük türü işaretleme, cümledeki fiil, isim, sıfat v.b. sözdizimsel rolüne bakarak her bir sözcüğe uygun etiketin atanmasıdır. Bu işlem için çeşitli yöntemler önerilmiştir. Morfolojik olarak zengin ve üretken sondan eklemeli dillerde sözcük formlarının sayısı teorik olarak sonsuzdur. Sözcük formlarındaki bu çeşitlilik, sondan eklemeli dillerde etiketleme işleminde seyreklik problemi yaratmaktadır. Bu tezde sözcük türü işaretleme ve gövdeleme işlemlerini eşzamanlı gerçekleştirerek sondan eklemeli dillerde bu problemin üstesinden gelmeyi amaçlamaktayız. Gövdeleme, bir sözcüğü eklerinden ayırarak gövdeyi bulma işlemidir. Birleşik sözcük türü işaretleme ve gövdeleme, sözcükler yerine gövde ve ekler kullanarak seyreklik problemini azaltmaktadır. Ayrıca, gövde ve gövdeden türetilmiş sözcük arasındaki benzerliği yakalamak için anlamsal özelliklerden yararlanmaktayız. Bu tezde, sondan eklemeli dillerde birleşik sözcük türü işaretleme ve gövdeleme işlemi gerçekleştirmek için tamamen gözetimsiz Bayesian Saklı Markov modeli sunulmuştur. Sonuçlar, özellikle sondan eklemeli diller için sözcükler yerine gövdeler ve eklerinin kullanılmasının sözcük tabanlı Bayesian HMM modelinden daha iyi olduğunu göstermektedir. Anlamsal özelliklerin eklenmesi ise gövdelemede belirgin bir iyileşme göstermektedir.	tr_TR
dc.contributor.department	Bilgisayar Mühendisliği	tr_TR
dc.contributor.authorID	230022	tr_TR

Bu öğenin dosyaları:

Ad:: Necva-Bölücü-tez.pdf
Boyut:: 4.492Mb
Biçim:: PDF
Açıklama:: Yüksek Lisans Tezi

Göster/Aç

Bu öğe aşağıdaki koleksiyon(lar)da görünmektedir.

Bilgisayar Mühendisliği Bölümü Tez Koleksiyonu [253]

Basit öğe kaydını göster