Apple is bringing new Siri voices to iOS 11, and the female voice sounds amazing. I’ve been using the male voice up until last night, but I decided to switch and go back to the female voice. I noticed that Apple is using a new voice actor for Siri in iOS 11, but it’s not just that.
Aside from sounding different, Siri is also smarter, and I wanted to find out what improvements were made to her brain. I found a paper written by the Siri Team where they talked about it, and here’s what I found.
Vocal Deep Learning
So what does it mean to use deep learning for a voice? In Apple’s new Machine Learning Journal, the Siri Team discussed how they accomplished this. Essentially, there are two techniques used in speech synthesis, which is the artificial production of human speech:
- Concatenative synthesis: Provides the highest quality output if you give it a sufficiently large number of speech recording input. Uses voice actors.
- Parametric synthesis: A model-based approach that uses statistics to replicate the human voice itself. Uses artificial intelligence, but still based on a human voice model as a starting point.
Concatenative synthesis is more widely used because if you give it high quality recordings, you get a high quality voice. Parametric synthesis is lower quality, even though it produces “highly intelligible and fluent speech.”
Basically, Apple is combining the two into a hybrid synthesis system using what is called a deep mixture density network (MDN), which is different than the traditional approach, called hidden Markov models (HMM). Translation: better algorithms.
The team trained the neural networks using over 20 hours of high-quality speech recordings sampled at 48 kHz. Advances in audio compression as well as this higher sampling rate (22 kHz vs. 48 kHz) results in a much more natural Siri voice.
Always Close the Barn Door Tight
If you go to the machine learning journal Apple published, you can hear sample text being read by Siri in iOS 9, iOS 10, and finally Siri in iOS 11. In iOS 9 and 10, the voices sound decent, but you can still tell that they are computer generated based on small glitchy sounds. But Siri in iOS 11 honestly sounds like a real human speaking into a microphone, and she’s a new voice actor.
What’s Next?
After 10 years, Siri finally sounds like a human. So what’s next? What about looking like a human? Do we want our digital assistants to have a visual component, or just audio? That’s something we’ll have to ponder over the next 10 years. In the meantime, there is always Siri-tan. That’s right, Japan has already personified Siri into an anime character, and you can hear her sing:
You can see the anime character in the featured image. It’s weirdly sexualized, so it’s not appealing to most people, but it would be interesting to see an official Apple-created Siri.
OK so the Siri-Tan you found is a bit…ahem…over the top.
But not all of them are.
https://www.ostan-collections.net/forum/index.php?topic=1600.0
Speaking is one thing, listening is another.
But can she sing?