Investigation of Imbalance Problem Effects on Text Categorization
Abstract
Text classification is a task of assigning a document into one or more predefined categories based on an inductive model. In general, machine learning algorithms assume that datasets consist of almost homogeneous class distribution. However, learning methods can be tended to the classification which has poorly performance over the minor categories while using imbalanced datasets. In multi-class classification, major categories correspond to the classes with the most number of documents and also minor ones correspond to the classes with the lowest number of documents. As a result, text classification is the process which can be highly affected from the class imbalance problem. In this study, we tackle this problem using category based term weighting approach in combination with an adaptive framework and machine learning algorithms.