Professor: Bryan Pardo
TA: Patrick O’Reilly
It is a commonly accepted fact in the AI community that the Automatic Speech Recognition (ASR) models can exhibit bias towards native speakers due to several factors. In one of the recent studies conducted at Washington University1, the researchers tried to examine discriminatory automatic speech recognition (ASR) performance as a function of speakers’ geopolitical orientation, specifically their first language. Unsurprisingly, they found that the ASR models were biased towards the native English speakers. We can see the results in the following graph where the X-axis represents a metric called “Word Information Lost” - the fraction of words that were changed, inserted, or deleted during the speech generation; and the Y-axis shows the first language of the respective speakers. As we can see in the graph, for all the three major ASR services, word information loss is lowest for native English speakers and it gets higher and higher for speakers from different backgrounds. Figure 1: Mean word information lost (WIL) for ASR services vs. first language1
We believe that one of the primary reasons for this disparity is that the ASR models are typically trained on large amounts of data, which may consist predominantly of speech from native speakers. This happens due to a lack of labeled audio datasets of non-native speakers speaking a particular language. As a result, the ASR models struggle with recognizing and accurately transcribing non-native accents or variations in pronunciation. This problem motivated us to think about some potential ways of generating high volumes of labeled audio data in multiple languages with diversified accents. Now, of course, we can generate high volumes of labled speech by using a text-to-speech system, but it doesn’t solve the problem of accented speech, which is the main issue with the low performance of ASR systems. Guo et. al2 have recently introduced QuickVC - an any-to-many voice conversion framework using inverse short-time Fourier transforms. QuickVC is trained on English speech, but we wondered if we can use it for generating accented speech in other languages too. We believed that doing so would provide us with a viable option for generating high volumes of labeled audio data with diversified accents that can be used for training or fine-tuning ASR systems. Figure 2: Flow design for generating high volume of labeled audio with diversified accents
Our flow starts with an input text in one of the ten selected languages. We feed this input text to Meta's text-to-speech (TTS) model6, which then generates a synthetic audio signal with the given text as the content. This audio signal serves as a source speech to the QuickVC voice conversion model. Then, we use ten different audio samples from the VCTK dataset as the target speech signals and feed them one by one to the QuickVC model along with the source audio to generate new speech signals having the content of the source audio and the style of the target audio. We ran this experiment for ten different languages, each having a hundred different prompts. In total, we generated 10 (languages) * 100 (prompts) * 10 (target speakers) = 10,000 speech signals.
For the text-to-speech synthesis, we used Qi et. al’s massively multilingual (60 languages) data set derived from TED Talk transcripts3. We chose to work with ten of the most common languages around the world, which includes English, Hindi, Spanish, Portuguese, Turkish, Russian, Swedish, Hungarian, Indonesian, and German. For target audio samples, we chose ten different speakers from the VCTK corpus4, which contains recordings of speech from 110 English speakers with diverse accents.
We wanted to evaluate our framework for two different tasks:
- How well the model was able to copy the target speaker's voice (Speaker Similarity).
- How much of the original content was preserved in the generated speech (Word Error Rate).
In order to measure the speaker similarity score for each language, we took the target and generated speech pairs and calculated their speaker embeddings with the help of a pre-trained voice encoder5. Then, we calculated the cosine similarity between the embedding pairs and finally averaged the score across all such pairs for a particular language. Cosine similarti score ranges from -1 to 1, where: a. 1 indicates that the vectors are perfectly similar or identical.
b. 0 indicates no similarity between the vectors.
c. -1 indicates that the vectors are perfectly dissimilar or opposite.
Figure 4: Speaker Similarity MeasureSince QuickVC was never trained on any other language except for English, we were interested in knowing how well it would preserve the content of the source audio in different languages. So, we decided to calculate the word error rate (WER) for the generated speech. WER is a common metric of the performance of an automatic speech recognition system. This value indicates the percentage of words that were incorrectly predicted. The lower the value, the better the performance of the ASR system with a WER of 0 being a perfect score. To calculate this, we utilized Meta's massively multilingual speech ASR6 to generate text tokens for the TTS output as well as the QuickVC output, and then we calculated the word error rate between the two sets of text tokens using JiWER7 - a simple and fast python package to evaluate automatic speech recognition systems.
We also calculated the word error rate between the TTS output and the original text prompt to account for the errors introduced by the Meta's TTS system.
Our assumption was that the model will perform well for English content, since it was trained only on English speech. But, we also assumed a significantly worse performance for other languages for the same reason. To our surprise, the model performed decently with other languages as well. The scores are discussed further in detail under the Results section.
Language | Cosine Similarity Score |
---|---|
English | 0.84 |
Hindi | 0.81 |
Spanish | 0.80 |
Portuguese | 0.82 |
Turkish | 0.82 |
Russian | 0.82 |
Swedish | 0.82 |
Hungarian | 0.80 |
Indonesian | 0.80 |
German | 0.81 |
The average cosine similarity score for each of the ten languages is 0.8 or higher, which indicates that the QuickVC model, despite being trained on just English speech, is robust enough to successfully convert the styles of non-English audio samples as well.
Language | WER Source | WER Voice Conversion |
---|---|---|
English | 0.370 | 0.185 |
Hindi | 0.480 | 0.294 |
Spanish | 0.317 | 0.153 |
Portuguese | 0.319 | 0.131 |
Turkish | 0.624 | 0.444 |
Russian | 1.025 | 0.310 |
Swedish | 0.434 | 0.289 |
Hungarian | 0.532 | 0.327 |
Indonesian | 0.458 | 0.255 |
German | 0.535 | 0.226 |
We can see that for most of the languages, the word error rate was comparable to English, in fact, it was even lower for Spanish and Portuguese. These results indicate that QuickVC is able to preserve the content of different languages after voice conversion despite not being trained on any of those languages.
source audio: Download audio target speaker: Download audio output: Download audioIn this case, we used an out-of-distribution target voice (in hindi), which wasn't present in QuickVC's training dataset, so the results aren't as good as the other demos. It looks like the model converted the source audio to one of the training voices which was closer to the target audio.
- https://doi.org/10.48550/arXiv.2208.01157
- Guo, Houjian, et al. "QuickVC: Many-to-any Voice Conversion Using Inverse Short-time Fourier Transform for Faster Conversion." arXiv preprint arXiv:2302.08296 (2023).
- Ye Qi, Devendra Sachan, Matthieu Felix, Sarguna Padmanabhan, and Graham Neubig. 2018. When and Why Are Pre-Trained Word Embeddings Useful for Neural Machine Translation?. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 529–535, New Orleans, Louisiana. Association for Computational Linguistics.
- Yamagishi, Junichi; Veaux, Christophe; MacDonald, Kirsten. (2019). CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit (version 0.92), [sound]. University of Edinburgh. The Centre for Speech Technology Research (CSTR). https://doi.org/10.7488/ds/2645.
- https://github.com/resemble-ai/Resemblyzer
- https://github.com/facebookresearch/fairseq/tree/main/examples/mms
- https://github.com/jitsi/jiwer/tree/master