Gen Açıklama Verilerinin Sınıflandırılmasında Yeni Bir Özellik Seçimi Yöntemi
Özet
Emergence of DNA microarray datasets started up a crucial research subject for both bioinformatics and machine learning. This type of data is obtained from tissue or cell samples and used to collect information that may be useful for disease diagnosis or distinguishing specific types of tumors. The biggest difficulty about this type of data – which is known as gene expression data – is that it includes information of thousands of genes whereas sample sizes are limited in a few dozens. This causes a disadvantage to correct classification of data.
Effective use of classification methods on gene expression data with thousands of genes and a small amount of sample size plays a vital role in diagnosis and treatment of illnesses. In large datasets like these, it is helpful to use feature selection which is a pre-processing step to increase the classification performance by selecting most related and informative features. Feature selection methods are described in three categories in the literature as filter, wrapper, and embedded methods. Filter methods are statistical feature selection methods that aim to select best feature subsets based on a certain evaluation measurement, independent from the classification algorithm
In this thesis, a new filter method for feature selection is suggested, namely “Feature Selection Algorithm based on Effective Ranges (FSAER)”. The suggested method aims to improve two current methods in the literature, namely “Effective Range based Gene Selection (ERGS)” and “Improved Feature Selection based on Effective Range (IFSER)”. ERGS and IFSER methods assign equal weight values to all discrete ranges. FSAER defines a new total area by taking discrete ranges into consideration in addition to having the advantages of ERGS and IFSER.
FSAER and five current filter methods are applied to six different open access gene expression datasets in order to validate the effectiveness of the suggested algorithm. Then, several classification methods (support vector machine, Naive Bayes, k-nearest neighbor) are employed to obtain the classification accuracies of the selected gene subsets. Findings of the applications are examined and FSAER is found to have highly effective results with regards to classification accuracy compared to the other methods.