WORLD is an open source library described as "a high-quality speech analysis, manipulation and synthesis system". I've been familiar with it for some time, and a lot of the general design ideas in the core of synSinger reflect those in WORLD.

WORLD uses three items to reconstruct speech:

  • The fundamental frequency
  • The spectral envelope
  • The aperiodicity measure

What got my attention is that WORLD doesn't retain phase information when reconstructing the vocal. Rather, it generates what it considers to be a reasonable value.

I read through a number of papers on WORLD and several videos, but they glossed over the specific details of how the phase was approximated.

So I finally dove into the code, and... Yeesh. I simply don't have the technical background to understand what's going on, and I have no clue who I could turn to for information.

I have a feeling that, at best, it's going to be a long slog to figure out how it's calculating the phase. Hopefully the results will be better than what I got with the Griffin-Lim code. It may even make me revisit that code, to see if I can find out where I went wrong there.

But I'm also considering whether I should simply use the WORLD library. After all, it pretty much already does what I'm trying to do. I could then simply focus on getting the framework to work with WORLD.