The project for this course was posted yesterday. We’re going to work on the task of Speech Synthesis. As a first impression, I think it’s going to be very challenging project, mainly because building a “good” speech synthesizer seems to me as a complex task that requires a lot of engineering and signal processing expertise. However, I am pretty excited to see how we can use deep learning algorithms for this task.
The first question I asked myself is how can we model speech synthesis as a statistical machine learning problem? I found a very good talk by Keiichi Tokuda that answers my question. Here are the slides and recordings of the talk. We’re mainly interested in the statistical formulation of the problem, because we can use the same formulation (probably simpler?) in our deep learning algorithms. I will not restate what’s in the slides, but in summary, the speech synthesis problem can be defined as:
Given a bunch of speech waveforms (), their text transcriptions (), and a text to be synthesized (). The output is a speech waveform (). We can define a probability distribution on the output speech waveform , which we can draw the output from that distribution (or find that maximizes the it).
In the talk the problem was decomposed into several sub-problems:
- Feature extraction from speech waveforms
- Feature extraction from text (labeling text)
- Acoustic modeling (parametric model of both speech and text transcriptions, and it’s build using features of text and speech)
- Text analysis for input text ), or in other words extracting features of the input text
- Speech parameters generation, using both the acoustic model and features of input text. Those parameters are used in generating the output speech waveform . This is actually the main bulk of the system.
- Waveform reconstruction from speech parameters
In the case of this talk, each of these sub-problems is handled with a separate model, but in our case we can probably build one deep model that learns automatically a good representation (features) of speech and text. I am not sure, however, of two things: what kind of preprocessing is necessary for speech waveforms before we feed it into the model? and do we train the model to directly generate it a speech waveform, or just speech parameters and use a fine tuned model for waveform reconstruction? I need to see how’s this handled in some of the recent deep learning papers.