All done! Thanks a lot to everyone who participated!
Idea:
To synthesize emotions into speech. Started with only anger here because I got it working literally half a week before submission and only had enough time to clone one emotion. Why? Because synthesized speech sucks. You've all have probably heard Stephen Hawking or one of those ones that come with Windows. The idea here is that putting
some emotion would make it sound a lot better, more human and less robot.
Conclusion:
It worked. Sorta. Rated 2.5 out of 5 anger, 2.4 out of 5 quality. Whether it's bad or not depends on what you're doing with it. I'd compare it to well, nice pixel art. You know what the picture's supposed to be, but it's not exactly photorealistic. It's got a few artifacts in the speech, but that buzz actually sounds good when you're used to it.
If you're pulling a prank on someone, it'd work very well on someone who didn't expect it (kinda like Photoshop). For speech synthesizers, it works great at making it sound less boring, just make the pitch contour higher to give it a happier sound, lower to give it a sad one.
Also anger has these spikes in pitch and energy contours. That's much there is to it. It's difficult to simulate just because they change so much more than the transformer can handle. Almost any other emotion has more subtle differences, it should work much better for those.
It's also basically a functional pitch contour transformer. I.e. it can correct you if you're singing out of tune. It's sort of like Photoshop for voice in that sense. But can't really fix your voice if you suck at singing, and if you sing out of key by around 50 Hz, it'd have a techno-ish effect. 50 Hz is still a huge range to change.. you shouldn't be singing that badly
Compared to what other people have done, well.. it's the most successful emotional transformation so far
Unless someone's put some top secret research into something better.
Implementation:
Anyway, while all these PhD students were taking huge piles of statistics, hidden markov models, and basically trying to inverse whatever knowledge they got from emotion detection, I took a more retarded game designer approach. I tried to simply quantify emotions as a bunch of numbers.
So, I split it down into three variables - energy contour, duration modification, and pitch contour. There's a bunch of theories I had around these. One was to imitate it exactly - which didn't work out so well, because it just doesn't go up higher than a certain pitch.
The others were kinda meh, proven wrong. One of them was proven kinda true. What was right is that people don't really notice a lot of the bad effects. I guess we're used to listening to horribly compressed music/videos/phone speech. It's fine to just mess it up.
Technical stuff:
Well, I'm not sure what to say about this. I'm not going to give 100% details until the thesis is officially published by the uni - the whole patent possibilities and all.
The stuff I could say is common knowledge. It uses a standard PSOLA (pitch modifying algorithm). It's just a basic pitch modifier in essence, with modifications to allow it to change time even though it was theoretically a stupid thing to do. I think everyone was skeptical about that, lol.
And uh.. yeah. I don't think any of you really play around with this stuff, so a detailed technical explanation doesn't help. But if you've got questions, ask
Why it shouldn't work:
I did take a hell lot of shortcuts. If a mechanical thing, it'd probably be duct taped all over the place. Surprisingly, it held together, and while I was asking my supervisor why it didn't work... it did. It worked so well that he asked me if the synthesized speech was the original. I'm still scratching my head about it working at all, but it does.
1. Never used any of the formulas or stuff suggested by technical papers. I looked at them for like 4 months and went all "screw this" and wrote some random code based on the pictures.
2. PSOLA doesn't use interpolation
In English, it's got big goddamn chunks in the pitch contour and nobody noticed.
3. Pitch detector doesn't work reliably. It needs to know the pitch before deciding what to change it to. It's sort of like a plane autoflying and landing without vision but not sure how high it's flying.
4. Pitch correction method is stupid. If you had someone screaming from a range of 40 to 400 Hz, it would just assume an error and assumes that you're screaming at 90 Hz for all that range. The "angry" speech shouldn't work at all. That's the first speech file for you guys who heard it.
5. It mixes voiced, silent, and unvoiced speech, which is epically stupid. They're two very different things (in design, not theory). I think some of you heard a big 'pop' in the middle of the second speech file. That seems to be the only noticeable one. Theoretically, it'd be 'popping' all over the place.
6. There's like 20 pages written on how to do duration modification properly. My system uses a "choose it at random" approach. Both work almost equally well, but my algorithm messes up epically when it increases duration by over 1.5.
Anyway, it raises some big questions about why they worked at all, and accidentally unlocked another branch of research into this stuff.