In this post I will talk about my very first experiment, which is predicting the next acoustic sample given a fixed window of previous samples. The idea here is just to get started in the project and prepare for more serious experiments.
Similar to what Hubert has done, the data I used for this experiment is just a single speaker’s raw wave sequences for each of the 10 sentences. The data was normalized by dividing by the maximum absolute value of the acoustic samples. This makes the data in [-1,1] range. The training and validation examples are constructed by taking sequential frames of length 240 samples (15ms * 16 sample/ms) from wave sequences, where the first 239 are considered as input and the last sample is the target. I took 80% of the data for training, 10% for validation, and 10% for testing. This gave 378856 training examples and 47357 validation and testing examples. I also shuffled the data so we can assume it’s IID. I based my code for dataset preparation on Laurent’s wrapper for TIMIT.
The model is a one hidden layer MLP, with one hidden layer of tanh activations, and one output with also tanh activation (the output is also normalized to [-1,1]). The loss function is mean squared error with L2 regularization term. The code using Theano can be found here.
For this experiment, I used the following values for hyper-parameters:
Learning rate: 0.01, # hidden units: 500, L2 term coefficient: 0.0001, mini-batch size: 1000.
The following figure shows the training and validation errors.
To understand the error we need to convert it into the scale of original acoustic samples. The largest absolute value of samples found in the data is 18102. Since we’re using mean squared error, we have to take the square root of the error and multiply it by 18102. The result for the lowest validation error (0.000218) is 267.27, which means that the average error in the predicted sample is ~267. I would say this is large. We also didn’t check the variance, which might be also large. I wouldn’t expect anything meaningful from this model though as it’s impossible to train a model on a examples of only very short speech signals and expect it to generate any possible signal. We certainly need more features – at least phonemes.
My plan for the following days is to work on the frame level, i.e. predicting next frame from previous frames, and taking phonemes into account. This will be the core of the model I talked about in the previous post, which will also have another component which helps align input phonemes with output frames.