I've been reading up on various other approaches to vocal synthesis, such as STRAIGHT and WORLD, to see if they could offer any approaches which would be beneficial to synSinger.

In general, the analysis phases seem to have three common elements: pitch tracking, spectral envelope estimation, and measuring aperiodicity.

I'm using Praat to do most of my heavy lifting, so pitch tracking is handled there. I've also got an advantage that the source material is recorded on set pitches, so it's not as difficult a problem to solve.

It was interesting to read about the approach the DIO algorithm took - it basically does a number of low-pass filters and then looks at zero-crossings to come up with pitch predictions. Some time back I'd written a (very bad) pitch correction program that took a similar approach, looking at zero-crossings of a wave (positive to negative and negative to positive) and then selected the most likely candidate. My routine also tracked to the wave maximum and minimum, but these turned out to be not terribly reliable, as they changed position as the wave moved.

But I'm pretty happy with Praat's pitch tracking, so no need to look at that.

Next came the problem accurately calculating the pitch envelope. Most of the algorithms seem to deal with the problem with amplitudes not falling at the exact center of the FFT bins, which is a real problem.

Once these points are known, the spectral envelope needs to be smoothed to remove the temporal noise that's introduced by the FFT analysis.

My general approach has been to cross my fingers that the pitch tracking algorithm is accurate, and use the DFT to calculate the amplitudes of the harmonics. I use splines through the harmonic points to estimate the envelope.

STRAIGHT does a number of steps to resolve these points, and… well, it seems be to be darned good at doing it. It had me wondering if perhaps I should be using STRAIGHT (or WORLD) to calculate the values instead of writing my own routines. Because... why reinvent the wheel?

There's then the step of calculating the aperiodicity of the wave. My current approach is very old-fashioned, assuming a binary choice between harmonics or noise. I've been playing around with the idea of doing something more sophisticated, but keep putting that off.

The part that's really got my interest is creating output from the analysis. As I understand it, STRAIGHT doesn't use phase information at all - just the pitch, spectral envelope and aperiodicity information.

So how is the glottal pulse generated? The papers I've looked at describe the spectral envelope as being convolved with a "spectrally flat" impulse train. I'm still digging my way through these papers to get an answer.