New Approach to Unsupervised Based Classification on Microarray Data
Abstract
Genetic studies have been an important part of medical researches in recent years. These studies have become essential for the development of personalized treatment options and discovery of new drugs. The majority of these researches have focused on obtaining gene expression data. Different methods have been developed for the analysis of gene expression data. The most important problem in the analysis of these data is that they are high dimensional to help find the expression levels for thousands of genes for the presence of a small number of individuals. Analyzing such data would not be possible with classical statistical methods because this type of data does not provide statistical assumptions. For this reason, data mining methods have been used for the analyses. According to the classical data mining approach, dimension reduction of high-dimensional data must be applied first by using Principal Component Analysis, Independent Component Analysis or Factor Analysis, then the classification, estimation or essential analysis methods such as clustering must be selected. Within the scope of this thesis, the solution has been suggested to the state of the factors of the reduced data to be similar, which is one of the missing points of this approach. In this context, the dimension has been reduced and factors have been obtained first in gene expression data, and then these structures have been analyzed by Random Forest, a most widely used tree-based method for the classification analysis. Results of this analysis were compared with the results of the use of cluster loadings obtained by size reduction proposed by the thesis study first, and then clustering factors with the Kohonen Self Organizing Map method in the Random Forest algorithm. One of the major advantages of the proposed approach is to send 1000 sub-samples selected by sampling method (bootstrap) to the Random Forest algorithm by replacing the factors clustered. In this way, both; data that could not be factorized were made more homogeneous by clustering analysis, and random selection criteria of the Random Forest method were further strengthened. The performance measures used in comparing these approaches are True Classification Rate, the F-score, Precision and Recall. Applications were carried out on two types of data: data publicly available based on 15 Gene Expression Omnibus database and 18 artificial data created for specific scenarios. The proposed method provided an average of 17.8% and 11.68% improvement for the true classification rate that is the most essential measure of comparison in data with 2 and 3 classes, and in artificial data an average of 14.5% improvement in data sets with 3 dimensions and have 3 classes with 50 individuals. The proposed method has increased the performance especially in data with less subjects and classes in terms of classification based on these findings. Software that can make all of these analyses more comfortable based on the R programming language has been developed within this thesis and the researchers will be able to carry out their own analysis.