I've been looking at the WORLD code, trying to figure out how the spectral envelopes are coded.
WORLD is doing something interesting - it's using DCT coefficients and non-linear Mel space.
The human ear doesn't hear sound linearly - there are some parts of the audio spectrum that it pays a lot of attention to, and some parts that it doesn't pay much attention to at all.
This is useful, because it allows compression of data - less numbers have to be stored for the parts the ear is relatively insensitive to.
The process involves mapping data from one space (frequency) to another (e.g., non-linear Mel Space), and it's pretty straight-forward. WORLD appears to perform this mapping using simple linear interpolation.
Once the data has been mapped to Mel space, a DCT (Discrete Cosine Transform) is applied, and the n most significant coefficients are gathered.
So the process of converting the spectral envelope into DCT coefficients seems to be (more or less):
- Interpolate the spectral envelope from frequency space to Mel space
- Perform the DCT in Mel space
- Gather the first n coefficients from bins of the the DCT
The process of converting the coefficients back to a spectral envelope is the inverse:
- Fill a DCT with zeros
- Put the saved coefficients into the first n bins of the DCT
- Perform an IDCT to get the spectral envelope in Mel space
- Interpolate the spectral envelope from Mel space to frequency space
This raises the question of why WORLD doesn't simply sample the spectral envelope in Mel space intervals and save the values. I haven't tested it, but I'm guessing the resulting envelope would be quite similar.
I suspect the DCT is used because it's useful for training neural networks. The DCT is well known as being able to de-correlate data, which is really helpful feature when training a neural network. And this one of WORLD's stated goals.
In any event, I think I've got a slightly better grasp on how WORLD is using DCT coefficients.
No comments:
Post a Comment