Just the next acoustic sample

In this post I will talk about my very first experiment, which is predicting the next acoustic sample given a fixed window of previous samples. The idea here is just to get started in the project and prepare for more serious experiments.

Similar to what Hubert has done, the data I used for this experiment is just a single speaker’s raw wave sequences for each of the 10 sentences. The data was normalized by dividing by the maximum absolute value of the acoustic samples. This makes the data in [-1,1] range. The training and validation examples are constructed by taking sequential frames of length 240 samples (15ms * 16 sample/ms) from wave sequences, where the first 239 are considered as input and the last sample is the target. I took 80% of the data for training, 10% for validation, and 10% for testing. This gave 378856 training examples and 47357 validation and testing examples. I also shuffled the data so we can assume it’s IID. I based my code for dataset preparation on Laurent’s wrapper for TIMIT.

The model is a one hidden layer MLP, with one hidden layer of tanh activations, and one output with also tanh activation (the output is also normalized to [-1,1]). The loss function is mean squared error with L2 regularization term. The code using Theano can be found here.

For this experiment, I used the following values for hyper-parameters:

Learning rate: 0.01, # hidden units: 500, L2 term coefficient: 0.0001, mini-batch size: 1000.

The following figure shows the training and validation errors.

first experiment's results

To understand the error we need to convert it into the scale of original acoustic samples. The largest absolute value of samples found in the data is 18102. Since we’re using mean squared error, we have to take the square root of the error and multiply it by 18102. The result for the lowest validation error (0.000218) is 267.27, which means that the average error in the predicted sample is ~267. I would say this is large. We also didn’t check the variance, which might be also large. I wouldn’t expect anything meaningful from this model though as it’s impossible to train a model on a examples of only very short speech signals and expect it to generate any possible signal. We certainly need more features – at least phonemes.

My plan for the following days is to work on the frame level, i.e. predicting next frame from previous frames, and taking phonemes into account. This will be the core of the model I talked about in the previous post, which will also have another component which helps align input phonemes with output frames.

2 thoughts on “Just the next acoustic sample

  1. Laurent February 17, 2014 at 4:40 pm Reply

    “The data was normalized by subtracting the mean and dividing by the standard deviation. This makes the data in [-1,1] range. ”
    Not really. However, you could divide by the maximum absolute value of acoustic samples of the centered waveform to obtain something in the [-1, 1] range.

  2. Amjad Almahairi February 17, 2014 at 9:29 pm Reply

    Yes, indeed. Thanks for pointing this out for me!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: