First of all thank you for your nice research! It is really inspiring.
> Also in Table 6, you see that Facebook's wav2vec 2.0 XLS-R went from 12.06% without LM to 4.38% with 5-gram LM.
It is probably Jonatas Grosman's model, not Facebook. Bias is a common sin for common voice trainers. Partially because they integrate Guttenberg texts into LM, partially because for some languages CV sentences intersect between train and test.
The improvement from LM is from 6.68 to 6.03 as expected.
> Zimmermeister 2022 wav2vec2 XLS-R was equally over-trained
Yes
> Or are you suggesting that all wav2vec2 -derived AI models are strongly overtrained for CommonVoice? Because they seem to do very well on LibriSpeech and GigaSpeech, too.
Not all the models are overtrained, I mainly complain about German ones. For example Spanish is reasonable:
> Could you explain what you mean by "perplexity" here? Can you recommend a paper about it? I haven't read about that in any of the ASR papers I studied, so this sounds like an exciting new technique for me to learn :)
> BTW, regardless of the metrics, this is the model that "works for me" in production.
Sure, but it could work even better if you take more generic model.
>BTW, BTW, it would be really helpful for research if Vosk could also publish a paper. As you can see, PapersWithCode.com currently doesn't list any Vosk WERs for CommonVoice German, despite the website reporting 11.99% for vosk-model-de-0.21.
I now had time and did some testing and the CER is already pretty much excellent for TEVR even without the language model, so it appears to me that what the LM does is mostly to fix the spelling. In line with that, recognition performance is still good for medical words, but in some cases the LM will actually reduce quality there by "fixing" a brand-name to a sequence of regular words.
Thanks for the perplexity paper :) I'll go read that now.
> Also in Table 6, you see that Facebook's wav2vec 2.0 XLS-R went from 12.06% without LM to 4.38% with 5-gram LM.
It is probably Jonatas Grosman's model, not Facebook. Bias is a common sin for common voice trainers. Partially because they integrate Guttenberg texts into LM, partially because for some languages CV sentences intersect between train and test.
For comparison you can check Nemo model
https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models...
The improvement from LM is from 6.68 to 6.03 as expected.
> Zimmermeister 2022 wav2vec2 XLS-R was equally over-trained
Yes
> Or are you suggesting that all wav2vec2 -derived AI models are strongly overtrained for CommonVoice? Because they seem to do very well on LibriSpeech and GigaSpeech, too.
Not all the models are overtrained, I mainly complain about German ones. For example Spanish is reasonable:
https://huggingface.co/patrickvonplaten/wav2vec2-large-xlsr-...
> Could you explain what you mean by "perplexity" here? Can you recommend a paper about it? I haven't read about that in any of the ASR papers I studied, so this sounds like an exciting new technique for me to learn :)
Perplexity is a measure of LM quality. See here:
EVALUATION METRICS FOR LANGUAGE MODELS https://www.cs.cmu.edu/~roni/papers/eval-metrics-bntuw-9802....
Also for a recent perplexities of transformers see somethinglike
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context https://aclanthology.org/P19-1285.pdf
> BTW, regardless of the metrics, this is the model that "works for me" in production.
Sure, but it could work even better if you take more generic model.
>BTW, BTW, it would be really helpful for research if Vosk could also publish a paper. As you can see, PapersWithCode.com currently doesn't list any Vosk WERs for CommonVoice German, despite the website reporting 11.99% for vosk-model-de-0.21.
Great idea, we'll get there, thank you!