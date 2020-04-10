One of the most annoying things during video calls – and generally in any type of call – are the background noises like other voices, the mouse keyboard or even animal noises, like barking. Microsoft knows this and therefore wants to eliminate these noises through an Artificial Intelligence that will intervene in real time, neutralizing these annoying noises.

This function was announced a few days ago by the North American company, but it is now when we know a little more detail about how it will work.

What Microsoft wants to avoid is precisely those awkward moments that usually occur during many video conferences in which someone is asked to silence their microphone because they are, perhaps, opening a package of food or because their dog is barking. However, you have to differentiate very well the non-stationary noises from the stationary onesSince the latter are already removed in the company’s current noise suppression system.

Currently, what is done is to take advantage of the pauses of the interlocutors to identify what sound is the announcer’s voice and what is background noise, such as the noise of the computer fan or similar noises. Therefore, this new implementation of Microsoft in its video call services would be focused on the noises more difficult to identify and isolate: non-stationary noises, which could even occur during a single time during that call.

A bark, someone opening a packet of food, a glass falling and falling, or slamming the door They could be non-stationary noises that are very difficult to identify as noise. However, according to a Microsoft spokesperson, the noise generated by instruments could not be eliminated, a person laughing, screaming or singing; Noises from other people speaking occur at the same frequency, so these noises cannot be isolated.

How Microsoft is training its AI to isolate non-stationary background noise

“We trained a model to understand the difference between noise and speech, and then the model is trying to keep the speech going,” explains Robert Aichner, group program manager for Microsoft Teams at VentureBeat. This has been done through a huge amount of videos of people talking in the background, in which, thanks to a transcription, Artificial Intelligence is able to follow the conversation and, in this way, discern between what is voice and what is noise.

«We take thousands of diverse speakers and over 100 types of noise. And then what we do is mix clean speech without noise with noise. Then we simulate a microphone signal. And then you also give the model a clean speech as the fundamental truth. ” Although it may seem simple, Microsoft has actually faced multiple problems. The main one has been to find a sufficiently representative data set. How to generate those background noises artificially?

Initially both audiobooks and YouTube data sets with tagged data were usedBut these models are drastically different from real video calls, especially audio books. For this reason, it was also decided to create videos specifically to enter them into the system, so that Artificial Intelligence was also trained in real situations.

The problem is that users’ video calls cannot be recorded for this purpose either, for obvious user privacy issues. But even doing it, for example, with your employees’ video calls, someone would have to be labeling the background noises.

