Giter Club home page Giter Club logo

gdm_final_project's Introduction

Zero-Shot Evaluation of QuickVC in Multilingual Setting

Developed By:

([email protected])

([email protected])

([email protected])


Professor: Bryan Pardo


Background

It is a commonly accepted fact in the AI community that the Automatic Speech Recognition (ASR) models can exhibit bias towards native speakers due to several factors. In one of the recent studies conducted at Washington University1, the researchers tried to examine discriminatory automatic speech recognition (ASR) performance as a function of speakers’ geopolitical orientation, specifically their first language. Unsurprisingly, they found that the ASR models were biased towards the native English speakers. We can see the results in the following graph where the X-axis represents a metric called “Word Information Lost” - the fraction of words that were changed, inserted, or deleted during the speech generation; and the Y-axis shows the first language of the respective speakers. As we can see in the graph, for all the three major ASR services, word information loss is lowest for native English speakers and it gets higher and higher for speakers from different backgrounds. test Figure 1: Mean word information lost (WIL) for ASR services vs. first language1

   

We believe that one of the primary reasons for this disparity is that the ASR models are typically trained on large amounts of data, which may consist predominantly of speech from native speakers. This happens due to a lack of labeled audio datasets of non-native speakers speaking a particular language. As a result, the ASR models struggle with recognizing and accurately transcribing non-native accents or variations in pronunciation. This problem motivated us to think about some potential ways of generating high volumes of labeled audio data in multiple languages with diversified accents. Now, of course, we can generate high volumes of labled speech by using a text-to-speech system, but it doesn’t solve the problem of accented speech, which is the main issue with the low performance of ASR systems. Guo et. al2 have recently introduced QuickVC - an any-to-many voice conversion framework using inverse short-time Fourier transforms. QuickVC is trained on English speech, but we wondered if we can use it for generating accented speech in other languages too. We believed that doing so would provide us with a viable option for generating high volumes of labeled audio data with diversified accents that can be used for training or fine-tuning ASR systems. test Figure 2: Flow design for generating high volume of labeled audio with diversified accents

   

Framework

Our flow starts with an input text in one of the ten selected languages. We feed this input text to Meta's text-to-speech (TTS) model6, which then generates a synthetic audio signal with the given text as the content. This audio signal serves as a source speech to the QuickVC voice conversion model. Then, we use ten different audio samples from the VCTK dataset as the target speech signals and feed them one by one to the QuickVC model along with the source audio to generate new speech signals having the content of the source audio and the style of the target audio. We ran this experiment for ten different languages, each having a hundred different prompts. In total, we generated 10 (languages) * 100 (prompts) * 10 (target speakers) = 10,000 speech signals.


test Figure 3: Framework

Dataset

For the text-to-speech synthesis, we used Qi et. al’s massively multilingual (60 languages) data set derived from TED Talk transcripts3. We chose to work with ten of the most common languages around the world, which includes English, Hindi, Spanish, Portuguese, Turkish, Russian, Swedish, Hungarian, Indonesian, and German. For target audio samples, we chose ten different speakers from the VCTK corpus4, which contains recordings of speech from 110 English speakers with diverse accents.

Evaluation

We wanted to evaluate our framework for two different tasks:

  1. How well the model was able to copy the target speaker's voice (Speaker Similarity).
  2. How much of the original content was preserved in the generated speech (Word Error Rate).

Speaker Similarity

In order to measure the speaker similarity score for each language, we took the target and generated speech pairs and calculated their speaker embeddings with the help of a pre-trained voice encoder5. Then, we calculated the cosine similarity between the embedding pairs and finally averaged the score across all such pairs for a particular language. Cosine similarti score ranges from -1 to 1, where: a. 1 indicates that the vectors are perfectly similar or identical.

b. 0 indicates no similarity between the vectors.

c. -1 indicates that the vectors are perfectly dissimilar or opposite.

test Figure 4: Speaker Similarity Measure

Word Error Rate (WER)

Since QuickVC was never trained on any other language except for English, we were interested in knowing how well it would preserve the content of the source audio in different languages. So, we decided to calculate the word error rate (WER) for the generated speech. WER is a common metric of the performance of an automatic speech recognition system. This value indicates the percentage of words that were incorrectly predicted. The lower the value, the better the performance of the ASR system with a WER of 0 being a perfect score. To calculate this, we utilized Meta's massively multilingual speech ASR6 to generate text tokens for the TTS output as well as the QuickVC output, and then we calculated the word error rate between the two sets of text tokens using JiWER7 - a simple and fast python package to evaluate automatic speech recognition systems.


test Figure 5: Word Error Rate (WER) - Voice Conversion

We also calculated the word error rate between the TTS output and the original text prompt to account for the errors introduced by the Meta's TTS system.


test Figure 6: Word Error Rate (WER) - Source

Our assumption was that the model will perform well for English content, since it was trained only on English speech. But, we also assumed a significantly worse performance for other languages for the same reason. To our surprise, the model performed decently with other languages as well. The scores are discussed further in detail under the Results section.


Results

Speaker Similarity

Language Cosine Similarity Score
English 0.84
Hindi 0.81
Spanish 0.80
Portuguese 0.82
Turkish 0.82
Russian 0.82
Swedish 0.82
Hungarian 0.80
Indonesian 0.80
German 0.81

The average cosine similarity score for each of the ten languages is 0.8 or higher, which indicates that the QuickVC model, despite being trained on just English speech, is robust enough to successfully convert the styles of non-English audio samples as well.

Word Error Rate

Language WER Source WER Voice Conversion
English 0.370 0.185
Hindi 0.480 0.294
Spanish 0.317 0.153
Portuguese 0.319 0.131
Turkish 0.624 0.444
Russian 1.025 0.310
Swedish 0.434 0.289
Hungarian 0.532 0.327
Indonesian 0.458 0.255
German 0.535 0.226

We can see that for most of the languages, the word error rate was comparable to English, in fact, it was even lower for Spanish and Portuguese. These results indicate that QuickVC is able to preserve the content of different languages after voice conversion despite not being trained on any of those languages.

Demo

Demo 1 (a): English

source audio: Download audio target speaker: Download audio output: Download audio


Demo 1 (b): English (with Hindi target audio)

source audio: Download audio target speaker: Download audio output: Download audio

In this case, we used an out-of-distribution target voice (in hindi), which wasn't present in QuickVC's training dataset, so the results aren't as good as the other demos. It looks like the model converted the source audio to one of the training voices which was closer to the target audio.


Demo 2: Hindi

source audio: Download audio target speaker: Download audio output: Download audio


Demo 3: Spanish

source audio: Download audio target speaker: Download audio output: Download audio


Demo 4: Portuguese

source audio: Download audio target speaker: Download audio output: Download audio


Demo 5: Turkish

source audio: Download audio target speaker: Download audio output: Download audio


Demo 6: Russian

source audio: Download audio target speaker: Download audio output: Download audio


Demo 7: Swedish

source audio: Download audio target speaker: Download audio output: Download audio


Demo 8: Hungarian

source audio: Download audio target speaker: Download audio output: Download audio


Demo 9: Indonesian

source audio: Download audio target speaker: Download audio output: Download audio


Demo 10: German

source audio: Download audio target speaker: Download audio output: Download audio


References

  1. https://doi.org/10.48550/arXiv.2208.01157
  2. Guo, Houjian, et al. "QuickVC: Many-to-any Voice Conversion Using Inverse Short-time Fourier Transform for Faster Conversion." arXiv preprint arXiv:2302.08296 (2023).
  3. Ye Qi, Devendra Sachan, Matthieu Felix, Sarguna Padmanabhan, and Graham Neubig. 2018. When and Why Are Pre-Trained Word Embeddings Useful for Neural Machine Translation?. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 529–535, New Orleans, Louisiana. Association for Computational Linguistics.
  4. Yamagishi, Junichi; Veaux, Christophe; MacDonald, Kirsten. (2019). CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit (version 0.92), [sound]. University of Edinburgh. The Centre for Speech Technology Research (CSTR). https://doi.org/10.7488/ds/2645.
  5. https://github.com/resemble-ai/Resemblyzer
  6. https://github.com/facebookresearch/fairseq/tree/main/examples/mms
  7. https://github.com/jitsi/jiwer/tree/master

gdm_final_project's People

Contributors

vthewise avatar multitude00999 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.