UNSUPERVISED JOINT PART-OF-SPEECH TAGGING AND STEMMING FOR AGGLUTINATIVE LANGUAGES
Özet
Part of Speech (PoS) tagging is the task of assigning each word an appropriate part of speech tag in a given sentence regarding its syntactic role such as verb, noun, adjective etc. Various approaches have already been proposed for this task. However, the number of word forms in morphologically rich and productive agglutinative languages is theoretically infinite. This variety in word forms causes sparsity problem in the tagging task for agglutinative languages. In this thesis, we aim to deal with this problem in agglutinative languages by performing PoS tagging and stemming simultaneously. Stemming is the process of finding the stem of a word by removing its suffixes. Joint PoS tagging and stemming reduces sparsity by using stems and suffixes instead of words. Furthermore, we incorporate semantic features to capture similarity between stems and their derived forms by using neural word embeddings.
In this thesis, we present a fully unsupervised Bayesian model using Hidden Markov Model (HMM) for joint PoS tagging and stemming for agglutinative languages. The results indicate that using stems and suffixes rather than full words outperforms a simple word-based Bayesian HMM model for especially agglutinative languages. Combining semantic features yields a significant improvement in stemming.