Neural Text Normalization for Turkish Social Media
Abstract
Social media has become a rich data source for natural language processing tasks with its worldwide use; however, it is hard to process social media data directly in language studies due to its unformatted nature. Text normalization is the task of transforming the noisy text into its canonical form. It generally serves as a preprocessing task in other NLP tasks that are applied to noisy text and the success rate gets higher when studies are performed on canonical text.
In this study, two neural approaches are applied for Turkish text normalization task: Contextual Normalization approach using distributed representations of words and Sequence-to-Sequence Normalization approach using encoder-decoder neural networks. As the conventional approaches applied to Turkish and also other languages are mostly domain specific, rule-based or cascaded, they are already becoming less efficient and less successful due to the change of the language use in social media. Therefore the proposed methods provide more comprehensive solution that are not sensitive to the language change in social media.