WORLD is an open source library described as "a high-quality speech analysis, manipulation and synthesis system". I've been familiar with it for some time, and a lot of the general design ideas in the core of synSinger reflect those in WORLD.
WORLD uses three items to reconstruct speech:
- The fundamental frequency
- The spectral envelope
- The aperiodicity measure
What got my attention is that WORLD doesn't retain phase information when reconstructing the vocal. Rather, it generates what it considers to be a reasonable value.
I read through a number of papers on WORLD and several videos, but they glossed over the specific details of how the phase was approximated.
So I finally dove into the code, and... Yeesh. I simply don't have the technical background to understand what's going on, and I have no clue who I could turn to for information.
I have a feeling that, at best, it's going to be a long slog to figure out how it's calculating the phase. Hopefully the results will be better than what I got with the Griffin-Lim code. It may even make me revisit that code, to see if I can find out where I went wrong there.
But I'm also considering whether I should simply use the WORLD library. After all, it pretty much already does what I'm trying to do. I could then simply focus on getting the framework to work with WORLD.
No comments:
Post a Comment