Giter Club home page Giter Club logo

Comments (6)

prorev avatar prorev commented on July 1, 2024

Why are alignments used for after all? Tacotron-2 paper will not mention alignments.

from fastspeech.

prorev avatar prorev commented on July 1, 2024

I found this in FastSpeech2 paper:

The training of FastSpeech relies on an autoregressive teacher model to provide 1) the duration of each phoneme to train a duration predictor, and 2) the generated mel-spectrograms for knowledge distillation. While these designs in FastSpeech ease the learning of the one-to-many mapping problem in TTS, they also bring several disadvantages: 1) the two-stage teacher-student distillation pipeline is complicated; 2) the duration extracted from the attention map of the teacher model is not accurate enough, and the target mel-spectrograms distilled from the teacher model suffer from information loss due to data simplification, both of which limit the voice quality and prosody.

This speaks clearly that you need another trained model to work with FastSpeech custom dataset, which is not so smart.

Or, the alignments are such a big problem, because based on those alignments the the training is possible. No alignments, no training. This paper "FastSpeech" is worth inspecting to understand how is done (in principle), but for some out of the box training possible is not the best choice.

You may find the alignments.py file was present in this project before but was removed. Commit id: e11b60d, but no commit message has been set to explain.

from fastspeech.

CanKorkut avatar CanKorkut commented on July 1, 2024

Thank you, i found alignments.py previous commit and tried it. In result, synthesis quality not bad, but when i inference long sentence long than five or six words, there was stuttering and missing letters problem in synthesis. Now i try FastSpeech2. Alignments are really such a big problem.

from fastspeech.

cuongnguyengit avatar cuongnguyengit commented on July 1, 2024

Hi, i have the same question. I also try to train my language with FastSpeech2, but alignments are really difficult.
My tacotron2 model is trained very good with my dataset. Therefore, its alignment will be good, but synthesis is quite bad.
They seem to be able to understand and are mixed. So, my question is whether durations generated by Tacotron matchs mels, energies, pitches generated by librosa or TacotronSTFT module. This problem is so complexity to explain how to FastSpeech2 made good quality audios. Thanks

from fastspeech.

CanKorkut avatar CanKorkut commented on July 1, 2024

image
I researched this problem and saw something about reduction factor. I didn't clearly understand architecture but we can say tacotron can easly learn with large reduction factor, however there is no reduction factor nvidia tacotron2 implementation. Maybe nvidia tacotron good for synthesis but it bad at for extract alignment. I'm not sure, i will research and editing.

from fastspeech.

khainh3101 avatar khainh3101 commented on July 1, 2024

@CanKorkut Hi, i'm using that alignment.py (Commit id: e11b60d) to extract alignments files but the result show different dimension with LJSpeech alignment files (in this source code Fast Speech already had). Can you show me your code to extract exactly alignment files to train another language ? thank you

from fastspeech.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.