In this post I will summarize the first model for the speech synthesis task. I’ll start with giving a high-level description of the task: We’ll be working with the TIMIT dataset, which has a set of utterances, each one is described by a sequence of words and their corresponding phonemes which are aligned with a sequence of acoustic samples (the speech waveform). Those sequences are aligned in the sense that we know when each word/phoneme starts and ends in time, so we can associate them “somehow” with sub-sequences of acoustic samples. We also have for each utterance the info of the speaker (age, dialect, ..etc). For now, I will probably work with only the phonemes and ignore the words. I might incorporate them into the input in the future. In addition, we can think of the output sequence of acoustic samples as a sequence of frames, which are usually sub-sequences spanning 10-20ms time worth of acoustic samples. Those can be either represented in the raw format or in one of the representations described in this post.
Ok, so we want to map a sequence of phonemes, say , into a sequence of frames . Notice that both sequences have different lengths, in fact m > n, so each phoneme produces multiple number of frames, and even this number varies for each phoneme depending on the context of this phoneme.
We can think of solving this problem by building a model that does the acoustic frame prediction one at a time. That is, each time step, we ask it to produce one frame based on the current phoneme, or a window of phonemes that we think collaborated in producing this frame. However, when we’re synthesizing speech, we don’t have a priori the number of frames we have to produce. This means we need to make our model learn that, too. I will describe here one way of doing that.
We want to learn two functions: ) and ). First, the input to the two functions is the same, where is a window of phonemes of length we’re using for producing one frame. is previous output frames, which help us predict the next frame . I have ignored here other inputs, like speaker information for clarity. Now, the value helps us decide whether we want to advance to the next window, i.e. shift our window by one. A simple approach for it is to be a binary value that tells us whether to advance or not. At each time step, we look at the value of and decide whether to move the window or not, then we produce the current frame .
My plan was originally to describe in this post a model for the full task, but I decided that it’s better to start with a simpler task. Actually the simplest thing one can start with is just predicting the next acoustic sample given previous samples, this model could be also helpful as a sub-model for other more complicated architectures, and I will report my current experiment and results in the following post.