Veri Madenciliği Sınıflama Yöntemlerinin Başarılarının Bağımlı Değişken Prevelansı Örneklem Büyüklüğü ve Bağımsız Değişkenler Arası Ilişki Yapısına Göre Karşılaştırılması
Özet
Decision Trees, Bayesian Networks, and Support Vector Machines are the most commonly used statistical and data mining based methods of classification in the literature and practice. While using these methods, the impact of important factors on the model success, such as, the measuring level of the independent variables (i.e., continuous, discrete, etc.), the distribution of the independent variables (i.e., symmetric, skewed, etc.), the amount of correlation between independent variables (i.e., low, medium or strong relationship), and the sample size are often ignored. Therefore, in this study, the impact of different structures of dependent and independent variables on the model performances of Decision Trees, Bayesian Networks, Support Vector Machines methods are compared by a simulation study. A total of 243 different simulation scenarios were obtained by taking into account three levels for the degree of correlation between independent variables, three levels for the number of independent variables in a model, three levels for the sample size, three levels for the amount of the correlation between dependent and independent variables, and three levels for the prevalence of the dependent variable. Each scenario was repeated 1000 times, for each repetition mentioned classification methods are applied and they were compared by their model success criteria. At the end of the thesis, some general suggestions are given to the researchers on which classification method should be used or avoided under different structures of dependent and independent variables.