## Frame prediction given phoneme window

I will explain here my latests experiments with implementing the MLP model I talked about in this post. The main idea is to implement the function $Y_t$ using an MLP that predicts an acoustic frame (a sequence of acoustic samples), based on the previous frame(s) and a window of phonemes, and another model for implementing $S_t$, which takes the same input but tells us if we have to shift the window of phonemes one step ahead or not.

Training and validation sets

First I will start with explaining the dataset I used for training the model. You can find the code for dataset preparation script here. The input in each training example is composed of a fixed number of acoustic samples (e.g. 600 samples, which is equivalent to 2.5 frames of 240 samples) and one-hot representation of two phonemes corresponding to the current frame (current window of phonemes). The target is just the following sub-frame (e.g. 40 samples). In the experiment below I used data of 10 speakers, with 60 input samples, where each sample is a float in [-1,1], and 39*2 floats in {0,1} for one-hot representation for two phonemes. The target is 40 floats in [-1,1]. This dataset has 221,530 examples, and that’s again only for 10 speakers! It’s clear that we need an efficient (RAM-wise and CPU-wise) way to deal with data that what’s being done by Vincent. Validation set has the only 5 sentences from one speaker.

The MLP for $Y_t$

I used an MLP with 2 hidden layers each has 300 units with tanh activations. The activation function for the output is also tanh. I trained the network using learning rate 0.01 (I am annealing the learning rate linearly to 0 starting after 30 epochs), 100 samples per mini-batch, and L2 regularization term 0.0001.I haven’t done a hyper-parameter optimization on the grid, but I have tried several configurations for the size of the layers and even changed activation of hidden layers to rectifiers (I used GPU to run the experiments quickly), and this is the setting I found to perform best. The following figure shows training and validation error.

The errors are very close and the model is clearly underfitting. I am not sure though why validation error is very close to training error though. The lowest error is 0.000816 and if you convert that into the scale of the original samples it turns out to be ~517.

Generating speech

In order to generate speech using this frame prediction model we really need something to give us a good alignment between phonemes and acoustic samples, and that’s what we want $S_t$ to do. However, we can cheat a little and assume that our $S_t$ is perfect, by taking the true alignment from our dataset. At the end if our frame prediction model doesn’t do a good job using the true alignment, there’s no point of wasting time in learning the alignment. This is what I did. The generation algorithms goes as follows: I take the current frame (in our case 600 samples) with the correct phoneme window, and feed them into the network, which gives us 40 samples, I shift it by 20 samples to the left and assign the first 20 predicted samples to the last 20 places of the shifted frame and consider that the new current frame and so on… Since sentences start with phoneme ‘#h’, I start with a frame of standard normal noise with small standard deviation. The generated signal is shown in blue in the following figure (green is the true speech signal):

Although the result is not impressive, I like that the model is able to capture some variability. The main problem I see here is the generated signal has much higher frequency and much lower amplitude. You can listen to the generated sound here. It sounds more like music! but you can still hear the variability. The code for the speech synthesizer is found here.

What’s next?

There are many directions for improvement. First, I want to train the model with the full dataset, so I need to check out Vincent’s data wrapper. I would like also to add more features to the input, for example the position of the current frame in the phoneme window, also I want to add speaker information. Another very important aspect is looking into better representation of the frames, which might be more robust to errors in prediction.

## Just the next acoustic sample

In this post I will talk about my very first experiment, which is predicting the next acoustic sample given a fixed window of previous samples. The idea here is just to get started in the project and prepare for more serious experiments.

Similar to what Hubert has done, the data I used for this experiment is just a single speaker’s raw wave sequences for each of the 10 sentences. The data was normalized by dividing by the maximum absolute value of the acoustic samples. This makes the data in [-1,1] range. The training and validation examples are constructed by taking sequential frames of length 240 samples (15ms * 16 sample/ms) from wave sequences, where the first 239 are considered as input and the last sample is the target. I took 80% of the data for training, 10% for validation, and 10% for testing. This gave 378856 training examples and 47357 validation and testing examples. I also shuffled the data so we can assume it’s IID. I based my code for dataset preparation on Laurent’s wrapper for TIMIT.

The model is a one hidden layer MLP, with one hidden layer of tanh activations, and one output with also tanh activation (the output is also normalized to [-1,1]). The loss function is mean squared error with L2 regularization term. The code using Theano can be found here.

For this experiment, I used the following values for hyper-parameters:

Learning rate: 0.01, # hidden units: 500, L2 term coefficient: 0.0001, mini-batch size: 1000.

The following figure shows the training and validation errors.

To understand the error we need to convert it into the scale of original acoustic samples. The largest absolute value of samples found in the data is 18102. Since we’re using mean squared error, we have to take the square root of the error and multiply it by 18102. The result for the lowest validation error (0.000218) is 267.27, which means that the average error in the predicted sample is ~267. I would say this is large. We also didn’t check the variance, which might be also large. I wouldn’t expect anything meaningful from this model though as it’s impossible to train a model on a examples of only very short speech signals and expect it to generate any possible signal. We certainly need more features – at least phonemes.

My plan for the following days is to work on the frame level, i.e. predicting next frame from previous frames, and taking phonemes into account. This will be the core of the model I talked about in the previous post, which will also have another component which helps align input phonemes with output frames.

## Initial model for speech synthesis task

In this post I will summarize the first model for the speech synthesis task. I’ll start with giving a high-level description of the task: We’ll be working with the TIMIT dataset, which has a set of utterances, each one is described by a sequence of words and their corresponding phonemes which are aligned with a sequence of acoustic samples (the speech waveform). Those sequences are aligned in the sense that we know when each word/phoneme starts and ends in time, so we can associate them “somehow” with sub-sequences of acoustic samples. We also have for each utterance the info of the speaker (age, dialect, ..etc). For now, I will probably work with only the phonemes and ignore the words. I might incorporate them into the input in the future. In addition, we can think of the output sequence of acoustic samples as a sequence of frames, which are usually sub-sequences spanning 10-20ms time worth of acoustic samples. Those can be either represented in the raw format or in one of the representations described in this post.

Ok, so we want to map a sequence of phonemes, say $X_1, X_2, ... X_n$, into a sequence of frames $Y_1, Y_2, ...Y_m$. Notice that both sequences have different lengths, in fact m > n, so each phoneme produces multiple number of frames, and even this number varies for each phoneme depending on the context of this phoneme.

We can think of solving this problem by building a model that does the acoustic frame prediction one at a time. That is, each time step, we ask it to produce one frame based on the current phoneme, or a window of phonemes that we think collaborated in producing this frame. However, when we’re synthesizing speech, we don’t have a priori the number of frames we have to produce. This means we need to make our model learn that, too. I will describe here one way of doing that.

We want to learn two functions: $S_t = g(Y_{t-1},\dots,Y_{t-k},W_i$) and $Y_t = f(Y_{t-1},\dots,Y_{t-k},W_i$). First, the input to the two functions is the same, where $W_i=[X_i, X_{i+1},\dots,X_{i+w}]$ is a window of phonemes of length $w$ we’re using for producing one frame. $Y_{t-1},\dots,Y_{t-k}$ is previous $k$ output frames, which help us predict the next frame $Y_t$. I have ignored here other inputs, like speaker information for clarity. Now, the value $S_t$ helps us decide whether we want to advance to the next window, i.e. shift our window by one. A simple approach for it is to be a binary value that tells us whether to advance or not. At each time step, we look at the value of $S_t$ and decide whether to move the window or not, then we produce the current frame $Y_t$.

My plan was originally to describe in this post a model for the full task, but I decided that it’s better to start with a simpler task. Actually the simplest thing one can start with is just predicting the next acoustic sample given previous samples, this model could be also helpful as a sub-model for other more complicated architectures, and I will report my current experiment and results in the following post.