Large-Scale Arabic Sentiment Corpus And Lexicon Building For Concept-Based Sentiment Analysis Systems
Nasser , Ahmed
xmlui.mirage2.itemSummaryView.MetaDataShow full item record
Within computer-based technologies, the usage of collected data and its size are continuously on a rise. This continuously growing big data processing and computational requirements introduce new challenges, especially for Natural Language Processing NLP applications. One of these challenges is maintaining massive information-rich linguistic resources which are fit with the requirements of the Big Data handling, processing, and analysis for NLP applications, such as large-scale text corpus. In this work, a large-scale sentiment corpus for Arabic language called GLASC is presented and built using online news articles and metadata shared by the big data resource GDELT. The GLASC corpus consists of a total number of 620,082 news article which are organized in categories (Positive, Negative and Neutral) and, each news article has a sentiment rating score value between -1 and 1. Several types of experiments were also carried out on the generated corpus, using a variety of machine learning algorithms to generate a document-level Arabic sentiment analysis system. For training the sentiment analysis models different datasets were generated from GLASC corpus using different feature extraction and feature weighting methods. A comparative study is performed, involving testing a wide range of classifiers and regression methods that commonly used for sentiment analysis task and in addition several types of ensemble learning methods were investigated to verify its effect on improving the classification performance of sentiment analysis by using different comprehensive empirical experiments. In this work, a concept-based sentiment analysis system for Arabic at sentence-level using machine learning approaches and a concept-based sentiment lexicon is also presented. An approach for generating an Arabic concept-based sentiment lexicon is proposed and done by translating the recently released English SenticNet_v4 into Arabic and resulted in producing Ar- SenticNet which contains a total of 48k of Arabic concepts. For extracting the concept from the Arabic sentence, a rule-based concept extraction algorithm called semantic parser is proposed and performed, which is generates the candidate concept list for an Arabic sentence. Different types of feature extraction and representation techniques were also presented and used for building the concept-based Sentence-level Arabic sentiment analysis system. For building the decision model of the concept-based Sentence-level Arabic sentiment analysis system a comprehensive and comparative experiments were carried out using variety of classification methods and classifier fusion models, together with different combinations of the proposed features sets. The obtained experiment results show that, for the proposed machine learning based Document-level Arabic sentiment analysis system, the best performance is achieved by the SVM-HMM classifier fusion model with a value of F-score of 92.35% and by the SVR regression model with RMSE of 0.183. On the other hand, for the proposed conceptbased sentence-level Arabic sentiment analysis system, the best performance is achieved by the SVM-LR classifier fusion model with a value of F-score of 93.92% and by the SVM regression model with RMSE of 0.078.