Development and Application of Novel Machine Learning Approaches for Rna-Seq Data Classification
Özet
RNA-Seq is a recent and efficient technique that uses the capabilities of next-generation sequencing technology in characterizing and quantifying transcriptomes. This technique has revolutionized the gene-expression profiling with major advantages over microarrays: (i) providing less noisy data, (ii) detecting novel transcripts and isoforms, and (iii) unnecessity of prearranged transcripts of interest. One important task using gene-expression data is to identify a small subset of genes and classify the data for diagnostic purposes, particularly for cancer diseases. Microarray based classifiers are not directly applicable due to the discrete nature of RNA-Seq data. Overdispersion is another problem that requires careful modeling of mean and variance relationship of the RNA-Seq data. Voom is a recent method that estimates the mean and variance relationship of the log-counts and provides precision weights for each observation to be used for further analysis. In this study, we developed VoomNSC method, which brings together voom and a powerful microarray classifier nearest shrunken centroids approaches for the purpose of "gene-expression based classification". VoomNSC is a sparse classifier that models the mean and variance relationship using voom method, incorporates the outputs of voom method (i.e. log-cpm values and precision weights) into NSC using weighted statistics. We also provided two non-sparse classifiers voomDLDA and voomDQDA, the extensions of diagonal linear and quadratic discriminant classifiers for RNA-Seq classification. A comprehensive simulation study is designed and four real datasets are used to assess the performance of developed approaches. Results revealed that voomNSC method performs as the sparsest classifier, also provides the most accurate results with power transformed Poisson linear discriminant analysis, and rlog transformed support vector machines and random forests algorithms. In conclusion, voomNSC is a fast, accurate and sparse classifier that can successfully be applied for diagnostic biomarker discovery and classification problems in medicine. This algorithm can also be used in other transcriptomics studies, such as separating developmental differences, cellular responses against stressors, or diverse phenotypes. An interactive web application is freely available at