A Computationally Efficient Heuristic Algorithm For Piecewise Linear Regression
Özet
Regression analysis is a method used to model the relationship between the dependent variables and the independent variable in the data, and to predict the dependent variable from the independent variables in the future data. Piecewise linear regression (PLR) is a powerful approach used in regression analysis. PLR models the data with multiple linear regression functions. Thus, it can model non-linear data as well as retain the interpretability of linear regression. Interpretability has recently become a hot topic for machine learning. In application areas such as finance and health, it is important for a model not only provide a good prediction, but also be interpretable and/or verifiable by domain experts. In this respect, we think PLR is a promising approach.
In this study, we define PLR problem as in which data is partitioned into intervals on a predetermined dimension and each interval is represented by a unique multivariate linear regression. We offer a method by adopting heuristic approaches and aim for a solution with practically acceptable computational efficiency even in large data sets. The proposed method is compared with Decision Tree, Random Forest, XGBoost, XGBoost with Random Forest learners, Support Vector Machines, K-Nearest Neighbors, Artifical Neural Network and Multivariate Adaptive Regression Splines algorithms. Several synthetic and real-world datasets containing moderate and large number of observations are used in the experiments. Synthetic data are particularly targeted by the proposed approach because they are generated in a way to include structural shifts. The results reveal that our method remains competitive with the well-known machine learning algorithms and outperforms especially in the synthetic dataset instances. Our method is also compared with a mathematical programming-based heuristic method and it is clearly observed that the proposed method provides better scores. When we examine at the computational efficiency results of the proposed method, we observe that even the largest datasets (containing 100000 observations) have computation times expressed in milliseconds. Overall, the results show that PLR can be an effective method, especially when considered the interpretability property it holds.