## Mixture Density Networks

Motivation One of the coolest ideas I learned in this course is the probabilistic interpretation of Neural Networks. Instead of using the NN to predict directly the targets, use it to predict the parameters of a conditional distribution over the targets (given the input). So you’re learning for each input you have a conditional distribution over the target, each has the same form, but with different parameters. For example, we can assume that the target has a Gaussian conditional distribution, and the network predicts the mean of this distribution (we assume here that the variance is independent of the input, but this assumption be relaxed). We saw in class that if we train the network with the mean-squared error loss function we get the same solution as when we train it with the negative log-likelihood of this Gaussian. One might hope that we learn a more complex (multi-modal) conditional distribution for the target. This is actually the goal of Mixture density networks! we want to model the conditional distribution as a mixture of Gaussians, where each Gaussian component parameters are dependent on the input, i.e.: $P(y_n \mid x_n) = \sum_{k=1}^K \pi_k(x_n) \mathcal{N}_k(y_n \mid \mu_k(x_n), \sigma_k^2(x_n))$ The network here has 3 types of outputs: the mixing coefficient $\pi_k$, mean of the Gaussian component $\mu_k$, and its variance $\sigma^2_k$. One might think: we don’t have ground truths for those outputs, how could make the network learn them?! the answer is we don’t need ground truths, because the loss function we’re going to use is the negative log likelihood given the data, so we just update the parameters of the model as to minimize this loss function. For more discussion about this model you can check Bishop’s book chapter 5. Implementation I have written a Theano implementation of mixture density networks (MDN) which you can find here. I wrote it such that it supports multiple samples at once, so the Gaussian components are multivariate, and it also supports mini-batches of data. This made the implementation a little more interesting since I have to deal with 3d tensor for $\mu$. What I did is that instead of having one matrix for the output layer in the case of a standard MLP, you have a tensor for $\mu$, and two matrices for $\sigma^2$ and $\pi$. The activation function for $\mu$ is the same as the desired output, and for $\sigma^2$ and $\pi$ it’s a softplus and softmax respectively. Similar to what David has observed, the straight implementation of MDN would cause a lot of NaNs. A very important issue when implementing MDN is that you have the log-sum-exp expression in the log likelihood, which can be numerically unstable. This can be fixed using this trick. I also had to use a smaller initial learning rate than a the one I used in my previous MLP, otherwise I would get NaN in the likelihood. With these two tricks, I don’t get any more NaNs. For the RNADE paper trick, I tried multiplying the mean with the variance in the cost function, but this changes the gradients of the variance and it makes the performance worse. In addition, I didn’t find it helping at all. Multiplying the gradient of the $\mu$ directly with the $\sigma$ is a little tricky when you’re using Theano’s automatic differentiation, and that’s probably why when I checked the RNADE code I found that they’re computing the gradients without using Theano’s T.grad. Experiments We would like to compare the MDN model with a similar MLP, and we can compare them in terms of the mean negative log-likelihood (Mean NLL) and the MSE on the same set of validation set. Computing the log-likelihood of the MLP is easy, it’s just the log of a Gaussian, with the output of the network as the mean, and its variance is the maximum likelihood estimate from the data, which turns out to be the MSE. On the other hand, to compute the MSE for the MDN model, we need to sample from the target conditional distribution. We do that by doing the following for each input data point: we sample the component from the multinomial distribution over the components (parametrized by the mixing coefficients), which gives us a selected component, and then sample the prediction from the selected Gaussian component. I ran the first set of experiments on the AA phone data set. I took 100 sequences for training and 10 for validation. I trained two models. Both the MLP and the MDN take as input a frame of 240 samples, and output one sample. The dataset used to train the models has 162,891 training examples and 14,739 validation examples. The following plot shows training and validation mean NLL for the MDN for the following hyper-parameter configuration:

• 2 hidden layers each has 300 units with tanh activations
• initial learning rate MDN: 0.001
• linear annealing of learning rate to 0 starting after 50 epochs
• 128 samples per mini-batch
• 3 components

We can the Mean NLL decreasing which means the model is learning. The validation Mean NLL stabilizes after almost 100 epochs. What I am mainly interested in though is comparing the same MLP architecture with the MDN. Therefore, I used the pretty much the same hyper-parameters for both networks to see if we can get advantage by just having the mixture of Gaussians at the output layer. The following is a plot that shows results on the same validation set and using the following hyper-parameters: I was expecting the MDN to perform better than the MLP. However, we can see that the MLP is better than the MDN both in terms of MSE and Mean NLL. The minimum MSE in the MLP is 0.0222 and for the MDN is 0.0324, and the minimum Mean NLL for MLP is -1.29, and for the MDN is -0.77. This is actually the typical performance pattern in pretty much all experiments I did on this dataset. To investigate more I tried varying the number of components, and found that performance improves only a little as we increase components (For 10 components the minimum Mean NLL reaches -0.91). In both models I was not able to generate something that sounds like \aa\, but the following generated waveform from the MDN model shows that it was able to capture the periodicity of the \aa\ sound, but it’s still more peaky than a natural signal:   We saw that the MDN doesn’t do better than the MLP in the \aa\ dataset, so it turns out we’re not benefiting from having a multi-model predictive distribution. To verify more, I performed another set of experiments on a more complicated task, where I used full utterances of one user (FCLT0) with the phoneme information (the current and next phonemes, as in the previous experiment). I trained on 9 utterances and validated on 1. The dataset has 402,939 training examples, and 70,621 validation examples. Using the same hyper-parameter settings, I got the following results:   Here we see that the MDN beats the MLP in terms of the Mean NLL, but still doesn’t perform better on MSE. This is kind of surprising, as you might think that the MDN has a better model for the data, but it’s probably the variance of the sampling from MDN that’s increasing the error. This is still something interesting to investigate more into in the future.

## Frame prediction given phoneme window

I will explain here my latests experiments with implementing the MLP model I talked about in this post. The main idea is to implement the function $Y_t$ using an MLP that predicts an acoustic frame (a sequence of acoustic samples), based on the previous frame(s) and a window of phonemes, and another model for implementing $S_t$, which takes the same input but tells us if we have to shift the window of phonemes one step ahead or not.

Training and validation sets

First I will start with explaining the dataset I used for training the model. You can find the code for dataset preparation script here. The input in each training example is composed of a fixed number of acoustic samples (e.g. 600 samples, which is equivalent to 2.5 frames of 240 samples) and one-hot representation of two phonemes corresponding to the current frame (current window of phonemes). The target is just the following sub-frame (e.g. 40 samples). In the experiment below I used data of 10 speakers, with 60 input samples, where each sample is a float in [-1,1], and 39*2 floats in {0,1} for one-hot representation for two phonemes. The target is 40 floats in [-1,1]. This dataset has 221,530 examples, and that’s again only for 10 speakers! It’s clear that we need an efficient (RAM-wise and CPU-wise) way to deal with data that what’s being done by Vincent. Validation set has the only 5 sentences from one speaker.

The MLP for $Y_t$

I used an MLP with 2 hidden layers each has 300 units with tanh activations. The activation function for the output is also tanh. I trained the network using learning rate 0.01 (I am annealing the learning rate linearly to 0 starting after 30 epochs), 100 samples per mini-batch, and L2 regularization term 0.0001.I haven’t done a hyper-parameter optimization on the grid, but I have tried several configurations for the size of the layers and even changed activation of hidden layers to rectifiers (I used GPU to run the experiments quickly), and this is the setting I found to perform best. The following figure shows training and validation error.

The errors are very close and the model is clearly underfitting. I am not sure though why validation error is very close to training error though. The lowest error is 0.000816 and if you convert that into the scale of the original samples it turns out to be ~517.

Generating speech

In order to generate speech using this frame prediction model we really need something to give us a good alignment between phonemes and acoustic samples, and that’s what we want $S_t$ to do. However, we can cheat a little and assume that our $S_t$ is perfect, by taking the true alignment from our dataset. At the end if our frame prediction model doesn’t do a good job using the true alignment, there’s no point of wasting time in learning the alignment. This is what I did. The generation algorithms goes as follows: I take the current frame (in our case 600 samples) with the correct phoneme window, and feed them into the network, which gives us 40 samples, I shift it by 20 samples to the left and assign the first 20 predicted samples to the last 20 places of the shifted frame and consider that the new current frame and so on… Since sentences start with phoneme ‘#h’, I start with a frame of standard normal noise with small standard deviation. The generated signal is shown in blue in the following figure (green is the true speech signal):

Although the result is not impressive, I like that the model is able to capture some variability. The main problem I see here is the generated signal has much higher frequency and much lower amplitude. You can listen to the generated sound here. It sounds more like music! but you can still hear the variability. The code for the speech synthesizer is found here.

What’s next?

There are many directions for improvement. First, I want to train the model with the full dataset, so I need to check out Vincent’s data wrapper. I would like also to add more features to the input, for example the position of the current frame in the phoneme window, also I want to add speaker information. Another very important aspect is looking into better representation of the frames, which might be more robust to errors in prediction.

## Just the next acoustic sample

In this post I will talk about my very first experiment, which is predicting the next acoustic sample given a fixed window of previous samples. The idea here is just to get started in the project and prepare for more serious experiments.

Similar to what Hubert has done, the data I used for this experiment is just a single speaker’s raw wave sequences for each of the 10 sentences. The data was normalized by dividing by the maximum absolute value of the acoustic samples. This makes the data in [-1,1] range. The training and validation examples are constructed by taking sequential frames of length 240 samples (15ms * 16 sample/ms) from wave sequences, where the first 239 are considered as input and the last sample is the target. I took 80% of the data for training, 10% for validation, and 10% for testing. This gave 378856 training examples and 47357 validation and testing examples. I also shuffled the data so we can assume it’s IID. I based my code for dataset preparation on Laurent’s wrapper for TIMIT.

The model is a one hidden layer MLP, with one hidden layer of tanh activations, and one output with also tanh activation (the output is also normalized to [-1,1]). The loss function is mean squared error with L2 regularization term. The code using Theano can be found here.

For this experiment, I used the following values for hyper-parameters:

Learning rate: 0.01, # hidden units: 500, L2 term coefficient: 0.0001, mini-batch size: 1000.

The following figure shows the training and validation errors.

To understand the error we need to convert it into the scale of original acoustic samples. The largest absolute value of samples found in the data is 18102. Since we’re using mean squared error, we have to take the square root of the error and multiply it by 18102. The result for the lowest validation error (0.000218) is 267.27, which means that the average error in the predicted sample is ~267. I would say this is large. We also didn’t check the variance, which might be also large. I wouldn’t expect anything meaningful from this model though as it’s impossible to train a model on a examples of only very short speech signals and expect it to generate any possible signal. We certainly need more features – at least phonemes.

My plan for the following days is to work on the frame level, i.e. predicting next frame from previous frames, and taking phonemes into account. This will be the core of the model I talked about in the previous post, which will also have another component which helps align input phonemes with output frames.

## Initial model for speech synthesis task

In this post I will summarize the first model for the speech synthesis task. I’ll start with giving a high-level description of the task: We’ll be working with the TIMIT dataset, which has a set of utterances, each one is described by a sequence of words and their corresponding phonemes which are aligned with a sequence of acoustic samples (the speech waveform). Those sequences are aligned in the sense that we know when each word/phoneme starts and ends in time, so we can associate them “somehow” with sub-sequences of acoustic samples. We also have for each utterance the info of the speaker (age, dialect, ..etc). For now, I will probably work with only the phonemes and ignore the words. I might incorporate them into the input in the future. In addition, we can think of the output sequence of acoustic samples as a sequence of frames, which are usually sub-sequences spanning 10-20ms time worth of acoustic samples. Those can be either represented in the raw format or in one of the representations described in this post.

Ok, so we want to map a sequence of phonemes, say $X_1, X_2, ... X_n$, into a sequence of frames $Y_1, Y_2, ...Y_m$. Notice that both sequences have different lengths, in fact m > n, so each phoneme produces multiple number of frames, and even this number varies for each phoneme depending on the context of this phoneme.

We can think of solving this problem by building a model that does the acoustic frame prediction one at a time. That is, each time step, we ask it to produce one frame based on the current phoneme, or a window of phonemes that we think collaborated in producing this frame. However, when we’re synthesizing speech, we don’t have a priori the number of frames we have to produce. This means we need to make our model learn that, too. I will describe here one way of doing that.

We want to learn two functions: $S_t = g(Y_{t-1},\dots,Y_{t-k},W_i$) and $Y_t = f(Y_{t-1},\dots,Y_{t-k},W_i$). First, the input to the two functions is the same, where $W_i=[X_i, X_{i+1},\dots,X_{i+w}]$ is a window of phonemes of length $w$ we’re using for producing one frame. $Y_{t-1},\dots,Y_{t-k}$ is previous $k$ output frames, which help us predict the next frame $Y_t$. I have ignored here other inputs, like speaker information for clarity. Now, the value $S_t$ helps us decide whether we want to advance to the next window, i.e. shift our window by one. A simple approach for it is to be a binary value that tells us whether to advance or not. At each time step, we look at the value of $S_t$ and decide whether to move the window or not, then we produce the current frame $Y_t$.

My plan was originally to describe in this post a model for the full task, but I decided that it’s better to start with a simpler task. Actually the simplest thing one can start with is just predicting the next acoustic sample given previous samples, this model could be also helpful as a sub-model for other more complicated architectures, and I will report my current experiment and results in the following post.

## Starting with IFT6266 project

The project for this course was posted yesterday. We’re going to work on the task of Speech Synthesis. As a first impression, I think it’s going to be very challenging project, mainly because building a “good” speech synthesizer seems to me as a complex task that requires a lot of engineering and signal processing expertise. However, I am pretty excited to see how we can use deep learning algorithms for this task.

The first question I asked myself is how can we model speech synthesis as a statistical machine learning problem? I found a very good talk by Keiichi Tokuda that answers my question. Here are the slides and recordings of the talk. We’re mainly interested in the statistical formulation of the problem, because we can use the same formulation (probably simpler?) in our deep learning algorithms. I will not restate what’s in the slides, but in summary, the speech synthesis problem can be defined as:

Given a bunch of speech waveforms ($X$), their text transcriptions ($W$), and a text to be synthesized ($w$). The output is a speech waveform ($x$). We can define a probability distribution on the output speech waveform $p(x|w,X,W)$, which we can draw the output $x$ from that distribution (or find $x$ that maximizes the it).

In the talk the problem was decomposed into several sub-problems:

• Feature extraction from speech waveforms
• Feature extraction from text (labeling text)
• Acoustic modeling (parametric model of both speech and text transcriptions, and it’s build using features of text and speech)
• Text analysis for input text $w$), or in other words extracting features of the input text
• Speech parameters generation, using both the acoustic model and features of input text. Those parameters are used in generating the output speech waveform $x$. This is actually the main bulk of the system.
• Waveform reconstruction from speech parameters

In the case of this talk, each of these sub-problems is handled with a separate model, but in our case we can probably build one deep model that learns automatically a good representation (features) of speech and text. I am not sure, however, of two things: what kind of preprocessing is necessary for speech waveforms before we feed it into the model? and do we train the model to directly generate it a speech waveform, or just speech parameters and use a fine tuned model for waveform reconstruction? I need to see how’s this handled in some of the recent deep learning papers.