Uçtan-Uca Konuşma Tanıma Modeli: Türkçe'deki Deneyler
Özet
For decades, the main components of Automatic Speech Recognition (ASR) systems have been pronunciation dictionary and Hidden Markov Models (HMMs). HMMs assume conditional independence between its output and creating the pronunciation dictionary have a tedious and time consuming process. Additionally, training each of these models are independent with each other and there especially exists a disconnect between acoustic model accuracy and word error rate (Word Error Rate) of automatic speech recognition. Connectionist Temporal Classification (CTC) character models attempts to solve some of these issues by jointly learning the pronunciation and acoustic model as a single model. However, both HMM and CTC models suffer from conditional independence assumption and rely heavily on a large enough language model during decoding.
In this thesis, we investigate the traditional paradigm of ASR and focus the limitations of HMM and CTC base speech recognition models. We propose an approach to ASR with neural attention mechanism models and we directly optimize speech transcriptions error rate in Turkish. The end-to-end recurrent neural network model jointly learns all the main components of a speech recognition system: the pronunciation dictionary, language model and acoustic model. We used transfer learning in our end-to-end architecture in order to training a good enough acoustic model using limited amount of transcribed speech data.