Giter Club home page Giter Club logo

expressive_tacotron's Introduction

A TensorFlow Implementation of Expressive Tacotron

This project aims at implementing the paper, Towards End-to-End Prosody Transfer for Expressive Speech Synthesis with Tacotron, to verify its concept. Most of the baseline codes are based on my previous Tacotron implementation.

Requirements

  • NumPy >= 1.11.1
  • TensorFlow >= 1.3
  • librosa
  • tqdm
  • matplotlib
  • scipy

Data

Because the paper used their internal data, I train the model on the LJ Speech Dataset

LJ Speech Dataset is recently widely used as a benchmark dataset in the TTS task because it is publicly available. It has 24 hours of reasonable quality samples.

Training

  • STEP 0. Download LJ Speech Dataset or prepare your own data.
  • STEP 1. Adjust hyper parameters in hyperparams.py. (If you want to do preprocessing, set prepro True`.
  • STEP 2. Run python train.py. (If you set prepro True, run python prepro.py first)
  • STEP 3. Run python eval.py regularly during training.

Sample Synthesis

I generate speech samples based on the same script as the one used for the original web demo. You can check it in test_sents.txt.

  • Run python synthesize.py and check the files in samples.

Samples

16 sample sentences in the first chapter of the original web demo are collected for sample synthesis. Two audio clips per sentence are used for prosody embedding--reference voice and base voice. Mostly, those two are different in terms of gender or region. The samples below look like the following:

  • 1a: the first reference audio
  • 1b: sample embedded with 1a's prosody
  • 1c: the second reference audio (base)
  • 1d: sample embedded with 1c's prosody

Check out the samples at each steps.

Analysis

  • Hearing the results of 130k steps, it's not clear if the model has learned the prosody.
  • It's clear that different reference audios cause different samples.
  • Some samples are worthy of note. For example, listen to the four audios of no.15. The stress of "right" part was obvious transferred.
  • Check out no.9, reference audios of which are sung. They are fun.

Notes

  • Because this repo focuses on investigating the concept of the paper, I did not follow some details of the paper.
  • The paper used phoneme inputs, whereas I stuck to graphemes.
  • Instead of the Bahdanau attention, the paper used the GMM attention.
  • The original audio samples were obtained from wavenet vocoder.
  • I'm still confused what the paper claims to be a prosody embedding can be isolated from the speaker.
  • For prosody embedding, the authors employed conv2d. Why not conv1d?
  • When the reference audio's text or sentence structure is totally different from the inference script, what happens?
  • If I have time, I'd like to implement their 2nd paper: Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis

April 2018, Kyubyong Park

expressive_tacotron's People

Contributors

kyubyong avatar reidsanders avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

expressive_tacotron's Issues

How to find out when training went wrong?

Thank you very much for your contribution. I have trained the model on LJ Speech for 835k. However, the results are not as good as the samples you provided for 420k. Maybe some problem with my training? Below you can find the attention plot and the sample audio at 835k. What kind of attention plot signals a good checkpoint for the synthesizer?
alignment_835k
And the progress was like this:
problem

The samples synthesized from this checkpoint can be found here:
https://www.dropbox.com/sh/n5ld72rn9otxl7a/AAACyplZMtxiYtuUgvWN8OGaa?dl=0

Also, the trained model (checkpoint), is uploaded here:
https://www.dropbox.com/sh/ks91bdputl5ujo7/AABRIqpviRDBgWuFIJn1yuhba?dl=0

Also, I was wondering if you have any plans to release your trained model.

Another thing is the tf.save keeps the last 5 checkpoints by default, and the wrapper used here (i.e. tf.train.Supervisor) does not easily allow changing max_to_keep property of the saver.

PS. The hyperparameters are kept as default.

    # signal processing
    sr = 22050 # Sample rate.
    n_fft = 2048 # fft points (samples)
    frame_shift = 0.0125 # seconds
    frame_length = 0.05 # seconds
    hop_length = int(sr*frame_shift) # samples.
    win_length = int(sr*frame_length) # samples.
    n_mels = 80 # Number of Mel banks to generate
    power = 1.2 # Exponent for amplifying the predicted magnitude
    n_iter = 50 # Number of inversion iterations
    preemphasis = .97 # or None
    max_db = 100
    ref_db = 20

    # model
    embed_size = 256 # alias = E
    encoder_num_banks = 16
    decoder_num_banks = 8
    num_highwaynet_blocks = 4
    r = 5 # Reduction factor.
    dropout_rate = .5

    # training scheme
    lr = 0.001 # Initial learning rate.
    logdir = "logdir"
    sampledir = 'samples'
    batch_size = 32
    num_iterations = 1000000

NotFoundError: Restoring from checkpoint failed.

python synthesize.py

tensorflow.python.framework.errors_impl.NotFoundError: Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

It seems that the prosody not transferd

I have checked your sample result of 420k trainning,and tried align the referr sound and the target sound,it seems that they are much different.
So i really confused how the auther of the paper done that.Maybe he used so large data set containing 100+ hours so that it can get good result.

no checkpoits after running train.py

@Kyubyong I'm able to run train.py successfully, but there is no checkpoints available in logdir, so unable to run synthesize.py

I have changed parameters such as:
lr = 0.9 # Initial learning rate.
logdir = "logdir"
sampledir = 'samples'
batch_size = 16

and also i made some changes in train.py

if name == 'main':
g = Graph(); print("Training Graph loaded")

sv = tf.train.Supervisor(logdir=hp.logdir, save_summaries_secs=60, save_model_secs=0)
with sv.managed_session() as sess:

    if len(sys.argv) == 2:
        sv.saver.restore(sess, sys.argv[1])
        print("Model restored.")

    #while 1:
        for _ in tqdm(range(g.num_batch), total=g.num_batch, ncols=70, leave=False, unit='b'):
            _, gs = sess.run([g.train_op, g.global_step])

            # Write checkpoint files
            if gs % 100 == 0:
                sv.saver.save(sess, hp.logdir + '/model_gs_{}k'.format(gs//100))

                # plot the first alignment for logging
                al = sess.run(g.alignments)
                plot_alignment(al[0], gs)

        #if gs > hp.num_iterations:
            #break

print("Done")

getting file not found error when i run train.py even when i m having file

(tf-gpu) [pranaw@login expressive_tacotron-master]$ python train.py
Traceback (most recent call last):
File "train.py", line 101, in
g = Graph(); print("Training Graph loaded")
File "train.py", line 35, in init
self.x, self.y, self.z, self.fnames, self.num_batch = get_batch()
File "/home/pranaw/TTS/expressive_tacotron-master/data_load.py", line 65, in get_batch
fpaths, texts = load_data() # list
File "/home/pranaw/TTS/expressive_tacotron-master/data_load.py", line 40, in load_data
lines = codecs.open(transcript, 'r', 'utf-8').readlines()
File "/home/pranaw/TTS/anaconda3/envs/tf-gpu/lib/python3.6/codecs.py", line 897, in open
file = builtins.open(filename, mode, buffering)
FileNotFoundError: [Errno 2] No such file or directory: '/data/private/voice/LJSpeech-1.0/metadata.csv'

An error of TypeError while running LJSpeech-1.1

Thanks for your contribution. ๐Ÿ‘

I follow the steps of Training: Run python train.py.

However, I got this error: :(
0%| | 0/409 [00:00<?, ?b/s]2018-04-08 11:58:13.819432:
W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: TypeError: a bytes-like object is required, not 'str'

ValueError: Cannot feed value of shape (16, 188) ...

I have trained the model on LJ Speech for around 670k steps. When using python synthesize.py command I have received this error.

Graph loaded
2018-04-25 19:19:45.131340: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-04-25 19:19:45.188585: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:892] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-04-25 19:19:45.188789: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.6705
pciBusID: 0000:01:00.0
totalMemory: 10.91GiB freeMemory: 114.62MiB
2018-04-25 19:19:45.188802: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
2018-04-25 19:19:45.189340: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 114.62M (120193024 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
Restored!
  0%|                                                   | 0/200 [00:00<?, ?it/s]Exception KeyError: KeyError(<weakref at 0x7fc729ad8100; to 'tqdm' at 0x7fc729b2e350>,) in <bound method tqdm.__del__ of   0%|                                                   | 0/200 [00:00<?, ?it/s]> ignored
Traceback (most recent call last):
  File "synthesize.py", line 69, in <module>
    synthesize()
  File "synthesize.py", line 59, in synthesize
    _y_hat = sess.run(g.y_hat, {g.x: texts, g.y: y_hat, g.ref: ref})
  File "/home/m/anaconda3/envs/ttse/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 889, in run
    run_metadata_ptr)
  File "/home/m/anaconda3/envs/ttse/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1096, in _run
    % (np_val.shape, subfeed_t.name, str(subfeed_t.get_shape())))
ValueError: Cannot feed value of shape (16, 188) for Tensor u'Placeholder:0', which has shape '(32, 188)'

Can you suggest any solution?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.