Chatgpt captured the imagination of many people across the world, and for good reason. Then, there was Dall-e, an AI system which could create realistic images and art from voice description.
Now, Microsoft have added their say into the advanced AI mechanisms with their new VALL-E system.
What is VALL-E?
VALL-E is an AI which can clone a human voice and from a very short exposure to the voice – about just three seconds audio and it is this feature that makes VALL-E special. While there are many multiple systems that can clone a human voice, they often require substantial input.
“VALL-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as an acoustic prompt,” Microsoft says in their paper of the software.
According to Microsoft, VALL-E has been trained on 60,000 hours of English language speech and from over 7,000 different speakers.
Potential security issues
Microsoft have decided to not make the VALL-E code open source due to potential risks that could include impersonation.
“Since VALL-E could synthesize speech that maintains speaker identity, it may carry potential risks in misuse of the model, such as spoofing voice identification or impersonation.” Says Microsoft.
And the risk of VALL-E is great, seeing how deep fakes are causing a major stir in the world. Prominent figures could potentially be the biggest victims of voice impersonation, which could have far-reaching consequences.
Potential real-world uses
VALL-E, despite its biggest risk of impersonation, could have several uses. It could for example, be used in speech editing, where certain phrases or words are corrected.
It could also be used in interactive virtual learning and could also help in customer service automation.
Since it is still a new technology, though, it remains to be seen how Microsoft plans to further develop VALL-E and what it could become.