Mikrodizilim Gen İfade Çalışmalarında Genelleştirme Yöntemlerinin Regresyon Modelleri Üzerine Etkisi.
Abstract
The presence of thousands of gene data belonging to a few number of patients in genetic researches leads to problems in the use of classical statistical methods (linear regression analysis etc.). However, analysis of large number of genes in microarray gene expression studies simultaneously has become possible recently by using data mining methods such as support vector machine, decision tree and boosted tree. In this study, prediction performances of these methods which don't require assumptions about the data structure and can model a large number of predictors were examined on gene data. One of the basic steps for analyses which were performed on gene expression data is generalization of models of analysis. If an independent test data is not available, the approaches such as resampling the original data should be used to estimate the accuracy of prediction. Another purpose of the study is to compare the effect of bootstrap and cross validation generalization methods on model performances of support vector regression and regression trees. Two different Monte Carlo simulations were carried out for performance comparison of regression models with generalization methods. Overall, bootstrap has given more optimistic performance than cross validation. The tools that are used in the development of prediction performance of a given model building technique are model aggregating (ensemble) methods. In this study, the performances of bagging and boosting methods were also compared on the examined regression methods. Bagging has provided the improvement of regression tree for datasets having at least a number of 25 observations, "i.e." n≥25. The application of real gene data has shown consistent results with the simulation study.