- 2 hidden layers each has 300 units with tanh activations
- initial learning rate MDN: 0.001
- linear annealing of learning rate to 0 starting after 50 epochs
- 128 samples per mini-batch
- 3 components

We can the Mean NLL decreasing which means the model is learning. The validation Mean NLL stabilizes after almost 100 epochs. What I am mainly interested in though is comparing the same MLP architecture with the MDN. Therefore, I used the pretty much the same hyper-parameters for both networks to see if we can get advantage by just having the mixture of Gaussians at the output layer. The following is a plot that shows results on the same validation set and using the following hyper-parameters: I was expecting the MDN to perform better than the MLP. However, we can see that the MLP is better than the MDN both in terms of MSE and Mean NLL. The minimum MSE in the MLP is 0.0222 and for the MDN is 0.0324, and the minimum Mean NLL for MLP is -1.29, and for the MDN is -0.77. This is actually the typical performance pattern in pretty much all experiments I did on this dataset. To investigate more I tried varying the number of components, and found that performance improves only a little as we increase components (For 10 components the minimum Mean NLL reaches -0.91). In both models I was not able to generate something that sounds like \aa\, but the following generated waveform from the MDN model shows that it was able to capture the periodicity of the \aa\ sound, but it’s still more peaky than a natural signal: We saw that the MDN doesn’t do better than the MLP in the \aa\ dataset, so it turns out we’re not benefiting from having a multi-model predictive distribution. To verify more, I performed another set of experiments on a more complicated task, where I used full utterances of one user (FCLT0) with the phoneme information (the current and next phonemes, as in the previous experiment). I trained on 9 utterances and validated on 1. The dataset has 402,939 training examples, and 70,621 validation examples. Using the same hyper-parameter settings, I got the following results: Here we see that the MDN beats the MLP in terms of the Mean NLL, but still doesn’t perform better on MSE. This is kind of surprising, as you might think that the MDN has a better model for the data, but it’s probably the variance of the sampling from MDN that’s increasing the error. This is still something interesting to investigate more into in the future.

]]>**Training and validation sets**

First I will start with explaining the dataset I used for training the model. You can find the code for dataset preparation script here. The input in each training example is composed of a fixed number of acoustic samples (e.g. 600 samples, which is equivalent to 2.5 frames of 240 samples) and one-hot representation of two phonemes corresponding to the current frame (current window of phonemes). The target is just the following sub-frame (e.g. 40 samples). In the experiment below I used data of 10 speakers, with 60 input samples, where each sample is a float in [-1,1], and 39*2 floats in {0,1} for one-hot representation for two phonemes. The target is 40 floats in [-1,1]. This dataset has 221,530 examples, and that’s again only for 10 speakers! It’s clear that we need an efficient (RAM-wise and CPU-wise) way to deal with data that what’s being done by Vincent. Validation set has the only 5 sentences from one speaker.

**The MLP for **

I used an MLP with 2 hidden layers each has 300 units with tanh activations. The activation function for the output is also tanh. I trained the network using learning rate 0.01 (I am annealing the learning rate linearly to 0 starting after 30 epochs), 100 samples per mini-batch, and L2 regularization term 0.0001.I haven’t done a hyper-parameter optimization on the grid, but I have tried several configurations for the size of the layers and even changed activation of hidden layers to rectifiers (I used GPU to run the experiments quickly), and this is the setting I found to perform best. The following figure shows training and validation error.

The errors are very close and the model is clearly underfitting. I am not sure though why validation error is very close to training error though. The lowest error is 0.000816 and if you convert that into the scale of the original samples it turns out to be ~517.

**Generating speech**

In order to generate speech using this frame prediction model we really need something to give us a good alignment between phonemes and acoustic samples, and that’s what we want to do. However, we can cheat a little and assume that our is perfect, by taking the true alignment from our dataset. At the end if our frame prediction model doesn’t do a good job using the true alignment, there’s no point of wasting time in learning the alignment. This is what I did. The generation algorithms goes as follows: I take the current frame (in our case 600 samples) with the correct phoneme window, and feed them into the network, which gives us 40 samples, I shift it by 20 samples to the left and assign the first 20 predicted samples to the last 20 places of the shifted frame and consider that the new current frame and so on… Since sentences start with phoneme ‘#h’, I start with a frame of standard normal noise with small standard deviation. The generated signal is shown in blue in the following figure (green is the true speech signal):

Although the result is not impressive, I like that the model is able to capture some variability. The main problem I see here is the generated signal has much higher frequency and much lower amplitude. You can listen to the generated sound here. It sounds more like music! but you can still hear the variability. The code for the speech synthesizer is found here.

**What’s next?**

There are many directions for improvement. First, I want to train the model with the full dataset, so I need to check out Vincent’s data wrapper. I would like also to add more features to the input, for example the position of the current frame in the phoneme window, also I want to add speaker information. Another very important aspect is looking into better representation of the frames, which might be more robust to errors in prediction.

]]>Similar to what Hubert has done, the data I used for this experiment is just a single speaker’s raw wave sequences for each of the 10 sentences. The data was normalized by dividing by the maximum absolute value of the acoustic samples. This makes the data in [-1,1] range. The training and validation examples are constructed by taking sequential frames of length 240 samples (15ms * 16 sample/ms) from wave sequences, where the first 239 are considered as input and the last sample is the target. I took 80% of the data for training, 10% for validation, and 10% for testing. This gave 378856 training examples and 47357 validation and testing examples. I also shuffled the data so we can assume it’s IID. I based my code for dataset preparation on Laurent’s wrapper for TIMIT.

The model is a one hidden layer MLP, with one hidden layer of tanh activations, and one output with also tanh activation (the output is also normalized to [-1,1]). The loss function is mean squared error with L2 regularization term. The code using Theano can be found here.

For this experiment, I used the following values for hyper-parameters:

Learning rate: 0.01, # hidden units: 500, L2 term coefficient: 0.0001, mini-batch size: 1000.

The following figure shows the training and validation errors.

To understand the error we need to convert it into the scale of original acoustic samples. The largest absolute value of samples found in the data is 18102. Since we’re using mean squared error, we have to take the square root of the error and multiply it by 18102. The result for the lowest validation error (0.000218) is 267.27, which means that the average error in the predicted sample is ~267. I would say this is large. We also didn’t check the variance, which might be also large. I wouldn’t expect anything meaningful from this model though as it’s impossible to train a model on a examples of only very short speech signals and expect it to generate any possible signal. We certainly need more features – at least phonemes.

My plan for the following days is to work on the frame level, i.e. predicting next frame from previous frames, and taking phonemes into account. This will be the core of the model I talked about in the previous post, which will also have another component which helps align input phonemes with output frames.

]]>Ok, so we want to map a sequence of phonemes, say , into a sequence of frames . Notice that both sequences have different lengths, in fact m > n, so each phoneme produces multiple number of frames, and even this number varies for each phoneme depending on the context of this phoneme.

We can think of solving this problem by building a model that does the acoustic frame prediction **one at a time**. That is, each time step, we ask it to produce one frame based on the current phoneme, or a window of phonemes that we think collaborated in producing this frame. However, when we’re synthesizing speech, we don’t have a priori the number of frames we have to produce. This means we need to make our model learn that, too. I will describe here one way of doing that.

We want to learn two functions: ) and ). First, the input to the two functions is the same, where is a window of phonemes of length we’re using for producing one frame. is previous output frames, which help us predict the next frame . I have ignored here other inputs, like speaker information for clarity. Now, the value helps us decide whether we want to advance to the next window, i.e. shift our window by one. A simple approach for it is to be a binary value that tells us whether to advance or not. At each time step, we look at the value of and decide whether to move the window or not, then we produce the current frame .

My plan was originally to describe in this post a model for the full task, but I decided that it’s better to start with a simpler task. Actually the simplest thing one can start with is just predicting the next acoustic sample given previous samples, this model could be also helpful as a sub-model for other more complicated architectures, and I will report my current experiment and results in the following post.

]]>The first question I asked myself is how can we model speech synthesis as a statistical machine learning problem? I found a very good talk by Keiichi Tokuda that answers my question. Here are the slides and recordings of the talk. We’re mainly interested in the statistical formulation of the problem, because we can use the same formulation (probably simpler?) in our deep learning algorithms. I will not restate what’s in the slides, but in summary, the speech synthesis problem can be defined as:

Given a bunch of speech waveforms (), their text transcriptions (), and a text to be synthesized (). The output is a speech waveform (). We can define a probability distribution on the output speech waveform , which we can draw the output from that distribution (or find that maximizes the it).

In the talk the problem was decomposed into several sub-problems:

- Feature extraction from speech waveforms
- Feature extraction from text (labeling text)
- Acoustic modeling (parametric model of both speech and text transcriptions, and it’s build using features of text and speech)
- Text analysis for input text ), or in other words extracting features of the input text
- Speech parameters generation, using both the acoustic model and features of input text. Those parameters are used in generating the output speech waveform . This is actually the main bulk of the system.
- Waveform reconstruction from speech parameters

In the case of this talk, each of these sub-problems is handled with a separate model, but in our case we can probably build one deep model that learns automatically a good representation (features) of speech and text. I am not sure, however, of two things: what kind of preprocessing is necessary for speech waveforms before we feed it into the model? and do we train the model to directly generate it a speech waveform, or just speech parameters and use a fine tuned model for waveform reconstruction? I need to see how’s this handled in some of the recent deep learning papers.

]]>