Automatic speech recognition without linguistic knowledge?
Description
Building state-of-the-art speech recognizers, besides a large corpus of transcribed speech, require several additional ingredients, such as a phoneme inventory, a lexicon, and a language model. These ingredients carry linguistic constraints to make training more feasible and more sample efficient. Recently, there has been a push towards building a speech recognizer end to end, i.e., using few or even none of the aforementioned ingredients. This raises a fundamental question: is it possible to train a speech recognizer without any linguistic constraint? How much data do we need to make it possible? What linguistic constraints are necessary for building a speech recognizer? In this talk, I will review the inner workings of conventional and end-to-end speech recognizers, and to help answer some of those questions, I will present empirical results in training end-to-end speech recognizers without any linguistic constraints.