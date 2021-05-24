When training an artificial intelligence system to transcribe from speech to text it is necessary to use many pairs of audio and text. That is, we give the AI ​​the sound “this is a cat” and that same transcribed sound, so that be able to associate each word to a sound. This is perfect for widely used languages, such as English or Spanish, but not for the most minority languages. Facebook, however, claims to have found a solution: wav2vec-U, with “U” for “Unsupervised”.

What is wav2vez-U? It is a way of building a speech recognition system that does not require any type of transcribed pair. It just learns from both audio and text decoupling, which completely eliminates the need for transcribed audio. To do this, the system uses a GAN (antagonistic generative network) that, according to Facebook, competes face to face with the best supervised systems of a few years ago.

A world of possibilities to transcribe minority languages

This new approach opens the door to much better speech recognition tech for languages ​​that have historically been overlooked by cutting edge technologies. The way it works is fascinating: pic.twitter.com/t7wTyaWRBv – Mike Schroepfer (@schrep) May 21, 2021

As detailed by Alexei Baevski, Wei-Ning Hsu, Alexis Conneu, and Michael Auli on the Facebook AI blog, their method begins with the learning speech structure from untagged audio. Using their previous model, wav2vec 2.0, they segmented the voice recording into voice units that correspond to individual sounds. For example, “cat”, cat in English, has three sounds: “/ K /”, “/ AE /” and “/ T /”.

To teach the system to understand the words in an audio, they used a GAN which, like all GANs, consists of a generator and a discriminator. The generator selects each piece of audio, predicts the phoneme corresponding to the sound in each language and tries to fool the discriminator. This is, in itself, another neural network that has been trained with the text outputs of the generator and real text from different sources divided into phonemes. This is important: actual text from different sources, not transcripts of the text we are trying to transcribe.

The job of the discriminator is to evaluate if the predicted phoneme sequences (“/ K /”, “/ AE /” and “/ T /” if we speak of “cat”) they seem realistic. The first generator transcripts are lousy, but with time and discriminator feedback, they get more and more accurate. And it is quite an achievement, since the system itself does not know that “cat” is transcribed as “cat”, but that understands that, because of the sounds that make up the word, it should be written like this.

Our new AI system learned speech recognition in English with zero speech to text training data: researchers just gave it lots of audio, and it figured out what the words were. But it goes way beyond that – it learned Swahili too! pic.twitter.com/H69GS0c7iG – Mike Schroepfer (@schrep) May 21, 2021

To test the system, Facebook used the TIMIT and Librispeech tests and claims that “wav2vec-U is as accurate as the state of the art from just a few years ago, without using any tagged training data. All told, these two benchmarks measure performance in English, a language with a large corpus of spoken and transcribed text. The Facebook system, however, is more interesting for minority languages, such as Swahili, Tatar or Kyrgyz, whose corpus of data is smaller.

It is, without a doubt, a great step forward when it comes to transcribing voice. Now it will be necessary to see how Facebook implements it, if it ever does. On the other hand, Zuckerberg’s company has published the necessary code to build this voice recognition system. It can be found on Github and can be accessed by anyone to tinker and test it out.

More information | Facebook AI