Introduction If you've been dipping your toes in the domain of automatic speech recognition ("ASR"), there's a good chance you've come across from Meta AI Research. There are some excellent technical resources, not the least of which is the itself, that describes how the machine learning ("ML") model works. Also, the Meta AI Research team has a nice on their website. wav2vec 2.0 ("wav2vec2") original wav2vec2 paper overview of wav2vec2 I would encourage you to take a look at it since it offers a nice summary of the academic paper and since the wav2vec2 model illustrations in this article are sourced from that page. With the preceding in mind, there don't seem to be many write-ups that explain wav2vec2 in "plain English." I try to do just that with this article. This article assumes that you understand some basic ML concepts and that you're interested in understanding how wav2vec2 works at a high level, without getting too deep "into the weeds." Accordingly, the subsequent sections try to avoid a lot of the technical details in favor of simple explanations and useful analogies when appropriate. That being said, it is helpful to know early on that wav2vec2 is comprised of 3 major components: the , the , and the . Feature Encoder Quantization Module Transformer Each will be discussed in the course of starting the discussion with some basic ideas while building up to more complex (but still digestible) points. Keep in mind that wav2vec2 can be used for other purposes beyond ASR. That being said, what follows here discusses the model in an ASR-specific context. A Gentle Overview At the time it was introduced in 2020, wav2vec2 offered a novel framework for building ASR systems. What was so special about it? Before wav2vec2, ASR systems were generally trained using . That is, prior models were trained on many examples of speech audio where example had an associated transcription. To explain the idea, consider this waveform: labeled data each It is not entirely clear what this waveform represents just looking at it. But, if you are told that the speaker who generated this audio said the words "hello world", you can probably make some intelligent guesses as to which parts of the waveform correspond with the text that represents it. You might surmise - correctly - that the first segment of the waveform is associated with the word "hello". Similarly, ASR models can learn how to make associations between spoken audio waveform segments and written text. However, as the original wav2vec2 investigators point out in their paper, "[many] speech recognition systems require thousands of hours of transcribed speech to reach acceptable performance which is not available for the vast majority of the nearly 7,000 languages spoken worldwide." So, the wav2vec2 investigators invented a new model where it is not necessary to have "thousands of hours of transcribed speech" in order to train the system. They reference a useful human analogy: babies learn to speak by hearing a word, and then immediately seeing a text representation of that word. don’t They learn of speech by listening to people in their environment (e.g., their parents, siblings, etc.). wav2vec2 learns in an analogous way: by first. representations listening Of course, how this is achieved is the point of the discussion in this article. Bear in mind that wav2vec2 is broadly designed to : accomplish 2 things Learn what the speech units should be given samples of unlabeled audio. Predict correct speech units. At this point, you don't need to completely understand what is meant by these points. They will be explained below. Just keep them in the back of your head for now. Learning Speech Units Imagine you have a huge dataset of audio samples - say for some number of English speakers. Even without a formal background in , you might understand intuitively that the English language is vocalized using a set of basic sounds that are "strung together" to form words, sentences, etc. phonetics Of course, if you're an English speaker, you don't think about speaking in this way and your vocalizations of whatever you want to say are more or less automatic! But, the point is that the spoken English language - and really any spoken language - can be decomposed into more basic, discrete sounds. If we could somehow coax an ASR model to "extract" these basic sounds, it would allow us to any audio sample of spoken language using them. This is what wav2vec2 does by on audio data. encode pretraining Pretraining, in this context, means that the first part of the model's training is insofar as it is "told" what the basic sounds should be for a given set of audio data. self-supervised not explicitly Diving down a bit more, the system is "fed" a large number of audio-only examples and, from those examples, is able to learn a set of . basic speech units Thus, every audio example is effectively composed of some combination of those speech units; in the same way that you can break a spoken audio sample into a sequence of . phonemes Importantly, the basic speech units that wav2vec2 learns are shorter than phonemes and are 25 milliseconds in length. The question that arises at this point is: How does wav2vec2 learn these speech units from audio alone? The process of learning speech units begins with the Feature Encoder. wav2vec2 "encodes speech audio via a multi-layer convolutional neural network." Convolutional neural networks, or CNNs, are models that allow us to learn features from a given input without those features being explicitly identified beforehand. Each of a CNN can be thought of as extracting features from an input, with those features becoming increasingly more complex as you move up to higher layers. layer In the case of audio data, you might imagine the first layer in a CNN examining of audio information and extracting low-level features, such as primitive sounds. windows A later layer in the same CNN, leveraging the lower-level features extracted in earlier layers, would encode higher-level features, such as sounds approximating phonemes. Following this idea, wav2vec2 can begin to by passing time slices of each audio example into the Feature Encoder and generating a of each slice. "learn what the speech units should be given samples of unlabeled audio" latent representation However, the collection of latent representations do not represent discrete speech units. These representations must be discretized in some way. This is accomplished by passing the output of the Feature Encoder to a . Quantization Module Effectively, the takes all the different audio representations generated by the Feature Encoder and reduces them to a finite set of speech units. It's worthwhile to ask at this point if wav2vec2 should be pretrained on a single language or a variety of languages. Quantization Module Logic tells us that capturing speech units that represent multiple languages versus a single language are likely to be more useful when designing ASR systems that can be used across many languages. To that end, pretraining wav2vec2 with a selection of multilingual audio samples enables the model to produce speech units that do in fact capture multiple languages. The wav2vec2 investigators noted the value behind this approach since Their original findings determined "for some languages, even [audio] data is limited." "that some units are used for only a particular language, whereas others are used in similar languages and sometimes even in languages that aren't very similar." Predicting Speech Units The inventory of speech units is a first step toward being able to encode spoken language audio samples. But, what we really want to achieve is to train wav2vec2 on how these units relate to one another. In other words, we want to understand what speech units are likely to occur in the same as one another. wav2vec2 tackles this task via the Transformer layer. context The Transformer essentially allows wav2vec2 to learn, in a statistical sense, how the speech units are distributed among the various audio examples. This understanding facilitates the encoding of audio samples that the model will "see" after pretraining. Finetuning Ultimately, an ASR system needs to be able to generate a text transcription for a given sequence of audio that it hasn't "seen" before. After pretraining via the steps described above, wav2vec2 is for this purpose. This time the model is shown examples of audio samples and their associated transcriptions. finetuned explicitly At this point, the model is able to utilize what it learned during pretraining to encode audio samples as sequences of speech units and to map those sequences of speech units to individual letters in the vocabulary representing the transcriptions (i.e. the letters "a" to "z" in the case of English). The learning during finetuning completes the training of the wav2vec2 model and allows it to predict the text for new audio examples that were not part of its training during finetuning. Conclusion Of course, the low-level mechanics of wav2vec2 are far more complex than what is presented above. However, to reiterate, the idea of this article is to provide you with a simple, conceptual understanding of how the model works and how it is trained. wav2vec2 is a very powerful ML framework for building ASR systems and its introduced in late 2021 was trained on 128 languages, thus providing an improved platform for designing ASR models across multiple languages. XLS-R variation As mentioned in the Introduction, there are a number of excellent technical resources available to help you learn more. In particular, you may find those provided by to be especially useful. Hugging Face

wav2vec2 for Automatic Speech Recognition In Plain English

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

3 Ways to Easily Visualize Keras Machine Learning Models

$PEPE, a Purple Lamborghini, and More: The Story Continues

102 Most Important Webism Quotes by Unknown Authors

100 Days of AI Day 3: Leveraging AI for Prompt Engineering and Inference

100 Days of AI, Day 8: Experimenting With Microsoft's Semantic Kernel Using GPT-4

10 Tips to Take Your ChatGPT Prompts to the Next Level

3 Ways to Easily Visualize Keras Machine Learning Models

$PEPE, a Purple Lamborghini, and More: The Story Continues

102 Most Important Webism Quotes by Unknown Authors

100 Days of AI Day 3: Leveraging AI for Prompt Engineering and Inference

100 Days of AI, Day 8: Experimenting With Microsoft's Semantic Kernel Using GPT-4

10 Tips to Take Your ChatGPT Prompts to the Next Level

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps