Automatic Generation of Scientific Terminology with Deep Learning
Karaman, İpek Nur
xmlui.mirage2.itemSummaryView.MetaDataShow full item record
Automatic term extraction is an essential task in natural language processing. In this thesis, we work on terminology extraction for two purposes. The first aim is to measure inconsistency of scientific terminology for different scientific disciplines. Terminology consistency in scientific writing is important for the dissemination of scientific information among researchers. In this thesis, we propose a metric that measures terminology inconsistency and we measure terminology inconsistency for different scientific disciplines by using automatic term extraction and statistical machine translation. Our results showed that the order of scientific groups by inconsistency in terminology is: PHY (Physical Sciences and Engineering) > SOC (Social and Behavioral Sciences) > LIF (Life Sciences). We also survey for verification of the results and survey results support our study. The second aim of this thesis is to leverage multilinguality with joint multilingual learning based on deep learning methods with sequence labeling and improve terminology extraction performance in Turkish by utilizing English data. Automatic term extraction using deep learning achieves promising results if sufficient training data exists. Unfortunately, some languages may lack these resources in some scientific domains causing poor performance due to under-fitting. In this thesis, we propose a joint multilingual deep learning model with sequence labeling to extract terms, trained on multilingual data and aligned word embeddings to tackle this problem. Our evaluation results demonstrate that the multilingual model provides an improvement for automatic term extraction task when it is compared with a monolingual model trained with limited training data. Although the improvement rate varies according to domain and the size of the data, our evaluation shows that the highest improvement in F1-score is 10.1 % in the domain of Computer Science, the least improvement is 7.6 % in the domain of Electronic Engineering. Our multilingual model also achieves competitive results when it is compared with a monolingual model trained with sufficient training data.