Uzun Süreli Verilerin Analizinde Kullanılan Makine Öğrenmesi Algoritmaları
Özet
Data obtained by tracking the same units over time and taking measurements repeatedly are called "longitudinal data”. Longitudinal data collected in fields such as medicine, psychology, sociology, environmental science, etc. are analysed with special statistical methods such as Mixed Effects Models rather than time series and classical regression analyses.
Machine learning algorithms, which have become increasingly popular in recent years, have become available for longitudinal datasets. At this point, with the effect of information technologies, these algorithms can be used as packages using software such as R and Python. Machine learning algorithms utilized in the analysis of longitudinal data are used to estimate the fixed effect parameters of Mixed Effects Models, and these methods can handle different responses such as categorical or quantitative results and survival times. Furthermore, these methods can work with variables of various scales or distributions without any assumption requirements and are also suitable for multidimensional datasets where the number of explanatory variables is greater than the number of observations.
In this thesis, a balanced longitudinal dataset containing measurements at 5 different time points for 1569 vehicles that regularly come for inspection every two years between 2013 and 2023 using administrative records compiled at vehicle inspection stations for motor vehicles registered to traffic in Türkiye is analysed with Mixed Effects Models as statistical models and machine learning algorithms, one of the branches of artificial intelligence. The effects of the explanatory variables of year, vehicle type, fuel type and purpose of use on the distances travelled by the vehicles according to years, which have measurements showing a right-skewed distribution, are examined by creating Mixed Effects Models with statistical and machine learning methods. Generalized Linear Mixed Effects Models (GLMM) from statistical methods and Mixed Effects Random Tree/Forest, Random Effects Expectation Maximisation Tree/Forest, GLMM Tree and Gaussian Process Boosting methods from machine learning methods were applied on the longitudinal dataset considering different link functions and covariance structures and the models were compared according to performance evaluation criteria. As a result of this study, it is concluded that the Mixed Effect Random Forest algorithm with AR(1) variance-covariance structure is the best model among all statistical and machine learning models according to the MSE, RMSE and MAE model performance evaluation criteria.