Zero-Shot Evaluation of QuickVC in Multilingual Setting

Developed By:

Background

It is a commonly accepted fact in the AI community that the Automatic Speech Recognition (ASR) models can exhibit bias towards native speakers due to several factors. In one of the recent studies conducted at Washington University¹, the researchers tried to examine discriminatory automatic speech recognition (ASR) performance as a function of speakers’ geopolitical orientation, specifically their first language. Unsurprisingly, they found that the ASR models were biased towards the native English speakers. We can see the results in the following graph where the X-axis represents a metric called “Word Information Lost” - the fraction of words that were changed, inserted, or deleted during the speech generation; and the Y-axis shows the first language of the respective speakers. As we can see in the graph, for all the three major ASR services, word information loss is lowest for native English speakers and it gets higher and higher for speakers from different backgrounds. Figure 1: Mean word information lost (WIL) for ASR services vs. first language¹

We believe that one of the primary reasons for this disparity is that the ASR models are typically trained on large amounts of data, which may consist predominantly of speech from native speakers. This happens due to a lack of labeled audio datasets of non-native speakers speaking a particular language. As a result, the ASR models struggle with recognizing and accurately transcribing non-native accents or variations in pronunciation. This problem motivated us to think about some potential ways of generating high volumes of labeled audio data in multiple languages with diversified accents. Now, of course, we can generate high volumes of labled speech by using a text-to-speech system, but it doesn’t solve the problem of accented speech, which is the main issue with the low performance of ASR systems. Guo et. al² have recently introduced QuickVC - an any-to-many voice conversion framework using inverse short-time Fourier transforms. QuickVC is trained on English speech, but we wondered if we can use it for generating accented speech in other languages too. We believed that doing so would provide us with a viable option for generating high volumes of labeled audio data with diversified accents that can be used for training or fine-tuning ASR systems. Figure 2: Flow design for generating high volume of labeled audio with diversified accents

Framework

Our flow starts with an input text in one of the ten selected languages. We feed this input text to Meta's text-to-speech (TTS) model⁶, which then generates a synthetic audio signal with the given text as the content. This audio signal serves as a source speech to the QuickVC voice conversion model. Then, we use ten different audio samples from the VCTK dataset as the target speech signals and feed them one by one to the QuickVC model along with the source audio to generate new speech signals having the content of the source audio and the style of the target audio. We ran this experiment for ten different languages, each having a hundred different prompts. In total, we generated 10 (languages) * 100 (prompts) * 10 (target speakers) = 10,000 speech signals.

Figure 3: Framework

Dataset

For the text-to-speech synthesis, we used Qi et. al’s massively multilingual (60 languages) data set derived from TED Talk transcripts³. We chose to work with ten of the most common languages around the world, which includes English, Hindi, Spanish, Portuguese, Turkish, Russian, Swedish, Hungarian, Indonesian, and German. For target audio samples, we chose ten different speakers from the VCTK corpus⁴, which contains recordings of speech from 110 English speakers with diverse accents.

Evaluation

We wanted to evaluate our framework for two different tasks:

How well the model was able to copy the target speaker's voice (Speaker Similarity).
How much of the original content was preserved in the generated speech (Word Error Rate).

Speaker Similarity

In order to measure the speaker similarity score for each language, we took the target and generated speech pairs and calculated their speaker embeddings with the help of a pre-trained voice encoder⁵. Then, we calculated the cosine similarity between the embedding pairs and finally averaged the score across all such pairs for a particular language. Cosine similarti score ranges from -1 to 1, where: a. 1 indicates that the vectors are perfectly similar or identical.

b. 0 indicates no similarity between the vectors.

c. -1 indicates that the vectors are perfectly dissimilar or opposite.

Figure 4: Speaker Similarity Measure

Word Error Rate (WER)

Since QuickVC was never trained on any other language except for English, we were interested in knowing how well it would preserve the content of the source audio in different languages. So, we decided to calculate the word error rate (WER) for the generated speech. WER is a common metric of the performance of an automatic speech recognition system. This value indicates the percentage of words that were incorrectly predicted. The lower the value, the better the performance of the ASR system with a WER of 0 being a perfect score. To calculate this, we utilized Meta's massively multilingual speech ASR⁶ to generate text tokens for the TTS output as well as the QuickVC output, and then we calculated the word error rate between the two sets of text tokens using JiWER⁷ - a simple and fast python package to evaluate automatic speech recognition systems.

Figure 5: Word Error Rate (WER) - Voice Conversion

We also calculated the word error rate between the TTS output and the original text prompt to account for the errors introduced by the Meta's TTS system.

Figure 6: Word Error Rate (WER) - Source

Our assumption was that the model will perform well for English content, since it was trained only on English speech. But, we also assumed a significantly worse performance for other languages for the same reason. To our surprise, the model performed decently with other languages as well. The scores are discussed further in detail under the Results section.

Results

Speaker Similarity

Language	Cosine Similarity Score
English	0.84
Hindi	0.81
Spanish	0.80
Portuguese	0.82
Turkish	0.82
Russian	0.82
Swedish	0.82
Hungarian	0.80
Indonesian	0.80
German	0.81

The average cosine similarity score for each of the ten languages is 0.8 or higher, which indicates that the QuickVC model, despite being trained on just English speech, is robust enough to successfully convert the styles of non-English audio samples as well.

Word Error Rate

Language	WER Source	WER Voice Conversion
English	0.370	0.185
Hindi	0.480	0.294
Spanish	0.317	0.153
Portuguese	0.319	0.131
Turkish	0.624	0.444
Russian	1.025	0.310
Swedish	0.434	0.289
Hungarian	0.532	0.327
Indonesian	0.458	0.255
German	0.535	0.226

We can see that for most of the languages, the word error rate was comparable to English, in fact, it was even lower for Spanish and Portuguese. These results indicate that QuickVC is able to preserve the content of different languages after voice conversion despite not being trained on any of those languages.

Demo

Demo 1 (a): English

source audio: Download audio target speaker: Download audio output: Download audio

Demo 1 (b): English (with Hindi target audio)

source audio: Download audio target speaker: Download audio output: Download audio

In this case, we used an out-of-distribution target voice (in hindi), which wasn't present in QuickVC's training dataset, so the results aren't as good as the other demos. It looks like the model converted the source audio to one of the training voices which was closer to the target audio.

References

https://doi.org/10.48550/arXiv.2208.01157
Guo, Houjian, et al. "QuickVC: Many-to-any Voice Conversion Using Inverse Short-time Fourier Transform for Faster Conversion." arXiv preprint arXiv:2302.08296 (2023).
Ye Qi, Devendra Sachan, Matthieu Felix, Sarguna Padmanabhan, and Graham Neubig. 2018. When and Why Are Pre-Trained Word Embeddings Useful for Neural Machine Translation?. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 529–535, New Orleans, Louisiana. Association for Computational Linguistics.
Yamagishi, Junichi; Veaux, Christophe; MacDonald, Kirsten. (2019). CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit (version 0.92), [sound]. University of Edinburgh. The Centre for Speech Technology Research (CSTR). https://doi.org/10.7488/ds/2645.
https://github.com/resemble-ai/Resemblyzer
https://github.com/facebookresearch/fairseq/tree/main/examples/mms
https://github.com/jitsi/jiwer/tree/master

multitude00999 / gdm_final_project Goto Github PK

gdm_final_project's Introduction

Zero-Shot Evaluation of QuickVC in Multilingual Setting

Developed By:

Course: Deep Generative Models, CS 496 Spring 2023, Northwestern University

Professor: Bryan Pardo

TA: Patrick O’Reilly

Background

Framework

Dataset

Evaluation

Speaker Similarity

Word Error Rate (WER)

Results

Speaker Similarity

Word Error Rate

Demo

Demo 1 (a): English

Demo 1 (b): English (with Hindi target audio)

Demo 2: Hindi

Demo 3: Spanish

Demo 4: Portuguese

Demo 5: Turkish

Demo 6: Russian

Demo 7: Swedish

Demo 8: Hungarian

Demo 9: Indonesian

Demo 10: German

References

gdm_final_project's People

Contributors

Watchers

Recommend Projects

Recommend Topics

Recommend Org