A Novel Multivariate Discretization Algorithm Using Dynamic Programming
Erdoğan, Ali Burak
xmlui.mirage2.itemSummaryView.MetaDataShow full item record
Discretization is the task of converting quantitative (continuous) numerical data into qualitative (categorical) by assigning them into non-overlapping intervals. It is an important step in reducing the complexity of data in data mining and exploratory data analysis studies. There are many methods that provide discretization schemes on continuous attributes, such as equal-width, equal-frequency, and minimum description length principle (MDLP). On the other hand, these methods ignore the multivariate nature of the dataset and focus on a single feature space for discretization. This causes a loss of information with respect to the correlations between attributes. Moreover, unlabeled data cannot be discretized with supervised methods (e.g. MDLP) that use class labels. We propose a new technique for unsupervised, multivariate, global, and static discretization; a discretizer based on information entropy which employs a constrained shortest-path algorithm. We test our technique on manually crafted randomized synthetic datasets as well as well-known real datasets. We show that our approach provides a more meaningful discretization in test cases. This may allow the retrieval of meaningful intervals, which are hidden, for data exploratory tasks. Also, classification accuracy on real datasets generally improves with our method unlike other univariate benchmark methods. Hence, our method may serve to achieve better accuracy on classification tasks.