My first attempt to determine a "good" match between the example phoneme and target phoneme didn't go that well.
The root cause is likely some logic errors. After hammering at the code for a while, I decided to try implementing something a bit simpler: calculating an error value by subtracting the difference between the learned values and the filter values. The values scaled by the log() because the differences are quite small.
After fixing a number of bugs in the code, I eventually got something that more or less worked. Here's the /ih/ phoneme - note that the tags have been manually placed.
Error comparing wave with /ih/ phoneme. Lower value is a better match.
Using the same /ih/ data, here's a comparison with another wave:
Error comparing another wave with the /ih/ phoneme.
Here's audio with an /ey/ phoneme. Since the start of the /ey/ phoneme sounds like the /ih/ phoneme, it's not surprising, there's a strong correspondence:
Error matching to a wave with no /ey/ phoneme.
On the other hand, it doesn't match strongly to the /er/ phoneme:
Error matching to a wave with the /er/ phoneme.
The recognizer code doesn't have to be terribly accurate - just "good enough". Hopefully the addition of a voiced/unvoiced flag will help as well.
At some point I should go back and try my original approach. But for now, this appears to work well enough for me to move forward. The next step is to store the learned phoneme values, and thenand implement a simple Viterbi algorithm to calculate the most likely path through a known set of phonemes.
No comments:
Post a Comment