Google's computer vision has seen major improvements over the years, a fact that is highlighted by the artificial intelligence chops of its Photos apps, which recognises faces, objects and more. Now, Google wants to do the same with voice as well. More specifically, audio-visual speech separation.
Say you are in a crowd of people and a familiar friend calls out to you. Even though you may not know the location of where your friend is standing, his or her voice has a certain pattern which you can immediately recognise irrespective of the noisy crowd around you. A machine may not be able to do that as efficiently. Just try controlling a smart speaker at your next house party when you want to play your music and you will know what we are talking about.
Google's researchers have developed a deep learning system which can separate voices by looking at people's faces when they're speaking and then boosts those voices. The team went about doing this by training a neural network to first understand and recognise individual voices of people when they were just talking by themselves. It then simulated virtual parties and mixed the individual voices in this, to teach the AI to learn to isolated multiple voices into separate audio tracks.
In the test clip from above, Google has managed to separate the voices of the two stand up comedians from the crowd (and each other) by recognising their faces and generating an audio track for that particular individual's speech. As the video progresses, you can push the slider on either end to just hear one particular comedian's voice more clearly, by drowning out the audience laughter.
According to Google, the technique involves combining the auditory and visual signals of an input video to separate the speech. Google looks at the movements of the person's mouth and correlates that with the sounds that are produced as the person is speaking. The combination of the visual element in addition to the audio, as opposed to just audio separation, helps in separating and having clean speech tracks associated with a particular visible speaker in a video.
This can be useful if you are trying to communicate in video chat services. In fact, Google is looking to explore opportunities to test this feature in its products such as Hangouts and Duo. This will boost up the voice of the person you are talking to, even if they are in a crowded room.
"We believe this capability can have a wide range of applications, from speech enhancement and recognition in videos, through video conferencing, to improved hearing aids, especially in situations where there are multiple people speaking," said the Google Research Blog.
Google also believes that this technology can help with automatic closed captioning systems where multiple speakers are overlapping each other. It can be used as a pre-process for speech recognition.
According to Engadget, it could also be misused. It could be used in public eavesdropping too. China could easily implement something like this on a mass scale, considering how it has been using facial recognition technology to compromise law-breakers in the country.