xiph / lpcnet Goto Github PK

View Code? Open in Web Editor NEW

1.1K 1.1K 295.0 1.07 MB

Efficient neural speech synthesis

License: BSD 3-Clause "New" or "Revised" License

Python 40.25% C 55.94% Shell 0.56% Makefile 0.78% M4 2.44% Batchfile 0.04%

lpcnet's People

Contributors

Stargazers

Watchers

Forkers

eezprince wgwangang wookay synthaether entn-at luvpine dengliqun hyzhan beckgom kastnerkyle kingulight kingstorm gosha20777 xzm2004260 changeforan yunzqq icefire-luo mazzzystar g-wang bliep geneing fatchord linzai1992 binyi10 qoboty vsooda lmaxwell mlwoo pan310 yangpeng08 zeroqiaoba haifengzeng yzliu90 light1726 edresson ml2457 mikesun4096 coastchb lbqin dachengai weil2017 gprachi28 kevincwu0 linlinsongyun halleyassist han-gang hallidayreadyone melspectrum007 picowatt igordzreyev jfsantos xjsxjtu mozilla-github-standards xinkez scoobydeux alayamanas hlp2819 rafaelmri chl916185 abraxas3d ahmed-fau mkolod chenyi0818 dabiaoma hhy5277 saber5433 j-fo-s balievdmitri del18687058912 walkitalki zhyoung24 fastcode3d ruimina gaoxiang45 zhly0 ensky0 kevinyang007 fanhuaandluomu yangmingqi zhipingzhou joseph-zhong nangongmu oytunturk jeffrey1hu ronggan wolf1981 human2b tuanad121 afd77 xnorpx japita-se renaissance25 windstudent begeekmyfriend aixingxy eastdragonfly xueyingxue001 batikim09 alexandreofbh macroustc

lpcnet's Issues

Endless loop in dump_data.c

Hi, @jmvalin

There is one situation that leads to an endless loop in dump_data.c
if (count==5000000 && one_pass_completed) break;
if count is bigger than 5000000 in one pass, the while loop will never stop.
I think ">= " is better.
#12

How to synthesise speech with C on GPU card

Hello @jmvalin,

I spent 1s to synthesise an audio 4s with C on cpu. How should i do to run it with c on gpu to improve it's speed? Thank you very much.

MemoryError when train_lpcnet

I train the model on another dataset, the features size is :

It seems that all data is loaded into CPU memory. My machine has 100GB+ CPU memory, and 8GB GPU memory. I met the memory error, when do train_lpcnet.py.

lambda_1 (Lambda)               (None, None, 128)    0           feature_dense2[0][0]             
                                                                 feature_dense2[0][0]             
__________________________________________________________________________________________________
concatenate_2 (Concatenate)     (None, None, 512)    0           reshape_1[0][0]                  
                                                                 reshape_2[0][0]                  
                                                                 lambda_1[0][0]                   
__________________________________________________________________________________________________
gru_a (CuDNNGRU)                [(None, None, 384),  1034496     concatenate_2[0][0]              
__________________________________________________________________________________________________
concatenate_3 (Concatenate)     (None, None, 512)    0           gru_a[0][0]                      
                                                                 lambda_1[1][0]                   
__________________________________________________________________________________________________
gru_b (CuDNNGRU)                [(None, None, 16), ( 25440       concatenate_3[0][0]              
__________________________________________________________________________________________________
dual_fc (MDense)                (None, None, 256)    9216        gru_b[0][0]                      
==================================================================================================
Total params: 1,259,334
Trainable params: 1,259,334
Non-trainable params: 0
__________________________________________________________________________________________________
Traceback (most recent call last):
  File "./src/train_lpcnet.py", line 104, in <module>
    pred_in = ulaw2lin(in_data)
  File "/home/storage11/heyunchao/Torch_Programs/LPCNet/src/ulaw.py", line 11, in ulaw2lin
    return s*scale_1*(np.exp(u/128.*math.log(256))-1)
MemoryError

Could someone tell me what I am wrong?

can gen wav in realtime in an A53 chip?

Why I definately follow the README.md and the loss curve doesn't descent at all?

Why I definately follow the README.md and the loss curve doesn't descent at all?
The curve keeps bigger than 5.4

CODE_OF_CONDUCT.md file missing

As of January 1 2019, Mozilla requires that all GitHub projects include this CODE_OF_CONDUCT.md file in the project root. The file has two parts:

Required Text - All text under the headings Community Participation Guidelines and How to Report, are required, and should not be altered.
Optional Text - The Project Specific Etiquette heading provides a space to speak more specifically about ways people can work effectively and inclusively together. Some examples of those can be found on the Firefox Debugger project, and Common Voice. (The optional part is commented out in the raw template file, and will not be visible until you modify and uncomment that part.)

If you have any questions about this file, or Code of Conduct policies and procedures, please see Mozilla-GitHub-Standards or email [email protected].

(Message COC001)

why sometimes it did not have the noise in the start even that they are initialised to 0.

`make test_lpcnet` error

When compile for test, make test_lpcnet , I met the following error. Does anyone know why?

make test_lpcnet
gcc -Wall -W -Wextra -Wno-unused-function -O3 -g -I../include  -mavx2 -mfma    -c -o src/nnet.o src/nnet.c
In file included from src/nnet.c:45:0:
src/vec_avx.h: In function ‘sgemv_accum16’:
src/vec_avx.h:167:24: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘y’
       float * restrict y;
                        ^
src/vec_avx.h:167:24: error: ‘y’ undeclared (first use in this function)
src/vec_avx.h:167:24: note: each undeclared identifier is reported only once for each function it appears in
src/vec_avx.h: In function ‘sparse_sgemv_accum16’:
src/vec_avx.h:193:24: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘y’
       float * restrict y;
                        ^
src/vec_avx.h:193:24: error: ‘y’ undeclared (first use in this function)
make: *** [src/nnet.o] Error 1

compile proble

i work on a centos 7.2 linux system
when I do make， I encouter problem like：
src/vec.h:141:27: error: expected '=', ',', ';', 'asm' or 'attribute' before 'y'
float * restrict y;
restrict need complile parameter：-std=c99

mysolution is：
export CFLAGS='-O3 -g -mavx2 -mfma -std=c99'
./configure
make

About loss curve

why there exists oscillating part?

dump_data -train create huge files even for small input

It seems that there is no test to see if the input ends on the second pass. So if the input data has less than 5M frames it will still generate huge features files.

I suggest changing https://github.com/mozilla/LPCNet/blob/master/src/dump_data.c#L264 to

if (!training || one_pass_completed) break;

How to build on VS2015

Hi, how to build on VS2015

pitch estimation algorithm explanation?

@jmvalin , @drowe67
In your code,
https://github.com/mozilla/LPCNet/blob/b811cade95cf19530ddfbe5aadeaff7c4f89ba77/src/dump_data.c#L160-L162
the first line is pitch period?
the second line is pitch correlation?what does it stands for?
what's the third line?
And your pitch estimation code is hard to read without any comments,
Can you provide some explanation, or provide links to related papers or materials?

By the way, thanks for sharing your excellent work!

Voicing decision

I am trying to figure out how to calculate the voicing decision from the pitch lag and correlation but I am making no headway. It seems that there is a moving threshold in pitch.c, but I cannot quite figure what's going on. I have put complete noise and complete voiced samples through dump_data and at the moment it seems that unvoiced frames get a correlation value < 0.2 (that is features[37:38]).

I have had a look at the paper as well as some Opus and SILK papers but I am not making any progress.
Does anybody maybe know how to get the voicing decision from these features?

No nnet_data.c and nnet_data.h

Hi, I note that there is no nnet_data.c in the src folder, so how can I get this file?
Thank you.

Batch_size error ?

Hi . Thanks for the code first.
When I run the training, I got an issue below. I just follow the simple instructions and did not add extra input. Do you know why this happens ? Thank you very much.

Epoch 1/120
Traceback (most recent call last):
File "./src/train_lpcnet.py", line 150, in
model.fit([in_data, in_exc, features, periods], out_data, batch_size=batch_size, epochs=nb_epochs, validation_split=0.0, callbacks=[checkpoint, lpcnet.Sparsify(2000, 40000, 400, (0.1, 0.1, 0.1))])
File "/usr/local/lib/python3.6/site-packages/keras/engine/training.py", line 1039, in fit
validation_steps=validation_steps)
File "/usr/local/lib/python3.6/site-packages/keras/engine/training_arrays.py", line 199, in fit_loop
outs = f(ins_batch)
File "/usr/local/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2715, in call
return self._call(inputs)
File "/usr/local/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2675, in _call
fetched = self._callable_fn(*array_vals)
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1382, in call
run_metadata_ptr)
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 519, in exit
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes: [153600] vs. [64,2400]
[[Node: metrics_1/sparse_categorical_accuracy/Equal = Equal[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](metrics_1/sparse_categorical_accuracy/Reshape, metrics_1/sparse_categorical_accuracy/Cast)]]
[[Node: loss_1/mul/_237 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2235_loss_1/mul", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

magic number 320/12/38.

/* -40 dB noise floor. */
ac[0] += ac[0]*1e-4 + 320/12/38.;

why add 320/12/38. ?
What number should be for 44.1KHz ?

Thank you very much ~

About loss and sparse_categorical_accuracy

Hi, I use this repo to train my own dataset, but i found that the sparse_categorical_accuracy increase very slowly

how about your training process?

No concat.sh

Hi,

I note that there is no concat.sh in the src folder, so how can I generate the input file?

Thank you.

ValueError: You are trying to load a weight file containing 10 layers into a model with 9 layers.

When i tried to ran python src/train_lpcnet.py features.f32 data.u8, it thrown error like below

==================================================================================================
Total params: 1,239,904
Trainable params: 170,752
Non-trainable params: 1,069,152
__________________________________________________________________________________________________
ulaw std =  27.343956536346006
Traceback (most recent call last):
  File "src/train_lpcnet.py", line 109, in <module>
    model.load_weights('lpcnet20c_384_10_G16_80.h5')
  File "/usr/local/lib/python3.6/site-packages/keras/engine/network.py", line 1166, in load_weights
    f, self.layers, reshape=reshape)
  File "/usr/local/lib/python3.6/site-packages/keras/engine/saving.py", line 1030, in load_weights_from_hdf5_group
    str(len(filtered_layers)) + ' layers.')
ValueError: You are trying to load a weight file containing 10 layers into a model with 9 layers.

Generate from unquantized parameters

Hello,
Is there a possibility to generate the speech waveform from an unquantized version of the coding parameters?

Complete demo from wav to wav?

Is it possible to encode from wav and decode to wav?
For example I downloaded wav speech from https://www.voiptroubleshooter.com/open_speech/american/OSR_us_000_0010_8k.wav

ffmpeg -i demo_data/OSR_us_000_0010_8k.wav
ffmpeg version 4.1 Copyright (c) 2000-2018 the FFmpeg developers
  built with Apple LLVM version 10.0.0 (clang-1000.10.44.4)
  configuration: --prefix=/usr/local/Cellar/ffmpeg/4.1 --enable-shared --enable-pthreads --enable-version3 --enable-hardcoded-tables --enable-avresample --cc=clang --host-cflags= --host-ldflags= --enable-ffplay --enable-gpl --enable-libmp3lame --enable-libopus --enable-libsnappy --enable-libtheora --enable-libvorbis --enable-libvpx --enable-libx264 --enable-libx265 --enable-libxvid --enable-lzma --enable-libvidstab --enable-opencl --enable-videotoolbox
  libavutil      56. 22.100 / 56. 22.100
  libavcodec     58. 35.100 / 58. 35.100
  libavformat    58. 20.100 / 58. 20.100
  libavdevice    58.  5.100 / 58.  5.100
  libavfilter     7. 40.101 /  7. 40.101
  libavresample   4.  0.  0 /  4.  0.  0
  libswscale      5.  3.100 /  5.  3.100
  libswresample   3.  3.100 /  3.  3.100
  libpostproc    55.  3.100 / 55.  3.100
Guessed Channel Layout for Input Stream #0.0 : mono
Input #0, wav, from 'demo_data/OSR_us_000_0010_8k.wav':
  Duration: 00:00:33.62, bitrate: 128 kb/s
    Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 8000 Hz, mono, s16, 128 kb/s
At least one output file must be specified

soxi demo_data/OSR_us_000_0010_8k.wav

Input File     : 'demo_data/OSR_us_000_0010_8k.wav'
Channels       : 1
Sample Rate    : 8000
Precision      : 16-bit
Duration       : 00:00:33.62 = 268985 samples ~ 2521.73 CDDA sectors
File Size      : 538k
Bit Rate       : 128k
Sample Encoding: 16-bit Signed Integer PCM

convert wav to pcm:

ffmpeg -y -i demo_data/OSR_us_000_0010_8k.wav -acodec pcm_s16le -f s16le -ac 1 -ar 16000 demo_data/input.pcm
real	0m0.039s
user	0m0.027s
sys	0m0.011s

encode:

time ./lpcnet_demo -encode demo_data/input.pcm demo_data/compressed.bin
real	0m0.688s
user	0m0.671s
sys	0m0.015s

decode(seems decode takes long on CPU):

time ./lpcnet_demo -decode demo_data/compressed.bin demo_data/output.pcm
real	0m7.000s
user	0m6.966s
sys	0m0.027s

convert pcm to wav:

time ffmpeg -f s16le -ar 16000 -ac 1 -i demo_data/output.pcm demo_data/output.wav
real	0m0.026s
user	0m0.015s
sys	0m0.010s

I wonder are my ffmpeg settings are correct?

Looks like here is some example using sox:
#4 (comment)

Using with Tacotron2

Hello,

I would like to connect a Tacotron2 model to LPCNet.
Is there a way to convert the 80-mel coefficients (output of Taco2) into the 18 Bark scale + 2 pitch parameters (input of LPCNet) ?

And somehow related, when reading about the Bark scale like here on wikipedia, there is usually 24 coefficients, and I don't understand how they are only 18 computed here. Even taking into account the 16kHz sampling, that would leave 22 of them, right ?

Thanks a lot :)

About Quality

I've used current model and tried first with given demo and found that quality is not as good as demo on xiph.org. Then I trained on my own gpu with all default settings (McGill database) and quality is better but worse then the demo. What could be the reason?

Some explanation about the recent commits

I have trained a model using the old code, and the model performs very well. In the recent 5 days, there are some new code update. Could @jmvalin or someone else take some explanation about these new update? Do I need to re-train the model using the new dump_data bin?

Some question about tensorfow version？

speed of LPCNet

Could I know about the speed of LPCNet?

LPCNet converge problem

Hi all
In my experiment for training LPCNet for TTS, it is found the LPCNet hard to converge and very like to generate local minimum . And the voice quality generated may vary in a large range .
I have tried to reduce learning rate or change batch size or change training criterion, but these seems not helping.
Do you have the same problem and any suggestions for training a stable LPCNet Vocoder quickly ?

Thanks

Issue in Sparsify?

In lpcnet.py, the following line do not seem to have the intended effect:

if self.batch < self.t_start or ((self.batch-self.t_start) % self.interval != 0 and self.batch < self.t_end):
            #print("don't constrain");
            pass

I assume the intention is to run the "sparsification" every self.interval from self.t_start to self.t_end. However, what it actually does is, in addition to this, run it for every single batch after self.t_end.

So I suggest the following correction:

if self.batch < self.t_start or ((self.batch-self.t_start) % self.interval != 0) or self.batch >= self.t_end:
            #print("don't constrain");
            pass

How to perform text to speech

@jmvalin Thanks for hosting this interesting project. As part of the usecases for LPCNet you mention about TTS(Text-To-Speech). How do we synthesis speech from text using the test_lpcnet.py?

If this is not the approach to implement TTS, do you have any recommendation on where to start with LPCNet for implementing end to end TTS system?

reserved identifier violation

I would like to point out that identifiers like “_LPCNET_H_” and “_NNET_H_” do not fit to the expected naming convention of the C language standard.
Would you like to adjust your selection for unique names?

train a model using 24kHz

Your paper is very impressive. 🙂
I'd like to train a model using 24kHz data instead of 16kHz data.
Is it easily possible?

Big noise in the wave start

I trained the model on my own data-set. The generated voice is very good, except the problem of noise in the wave start.

The reason I think is that, the training data is a random clip from the merged voice, so there are many training data without silence start, any only few data is start with silence.

Does anyone also have this problem?

about " Remove useless (and possibly hurtful) residual connection I guess it's a bad idea to forward inputs directly"

have anyone experiment it? and what is the result?

The following error occurred when I build the code

A question about traning

train_lpcnet.py
pcm_file = 'test2.s16' # 16 bit unsigned short PCM samples
data = np.fromfile(pcm_file, dtype='uint8')
sig = np.reshape(data[0::4], (nb_frames, pcm_chunk_size, 1))
pred = np.reshape(data[1::4], (nb_frames, pcm_chunk_size, 1))
in_exc = np.reshape(data[2::4], (nb_frames, pcm_chunk_size, 1))
out_exc = np.reshape(data[3::4], (nb_frames, pcm_chunk_size, 1))
the pcm_file is 16-bit file so it should be split to 2 parts,but there are four variables.
What do these four variables represent?Thanks

A possible bug in dump_data.c?

line 101, in implementation of frame_analysis:

RNN_COPY(st->analysis_mem, &in[FRAME_SIZE-OVERLAP_SIZE], OVERLAP_SIZE);

should be "RNN_COPY(&(st->analysis_mem[OVERLAP_SIZE-FRAME_SIZE]), in, FRAME_SIZE)" ?

feature size ?

Hi . I have another question. Why I used the different data, one is about 8 times larger than the other.
The extracted feature is identical the same size ? Is there anything wrong ?

Best

About sparsification code

sHi, I find that when doing sparsification on GRU_A during training process, the sparsified weights (with shape (384, 1152)) are vertical strips of 1. However, when dumping the model, the weight matrix (with shape (384,1152)) are horizontal strips of non-zero values.
Why is this happening? This might be a silly question, but it troubled me the entire day.
Thanks!

how about using 80-fbank as a condition not using 20bark-peroid?

have anyone tried it?

Completion of error handling

Would you like to add more error handling for return values from functions like the following?

fopen ⇒ main
malloc ⇒ rnnoise_create

Wiki changes

FYI: The following changes were made to this repository's wiki:

defacing spam has been removed
Restricting write access to contributors is strongly encouraged. Please make that change (documentation).

These were made as the result of a recent automated defacement of publically writeable wikis.

large noise in the wave mid

Transfer the model to pytorch

I'm trying to transfer this project to pytorch, and i'd like to know what should i change if i want to reuse the c files. Do i only need to change the pytorch model into the keras-like model? @jmvalin I'm now met troubles with the c program when running the sparse gru layer, the results differ from the pytorch model.

how to connect LPCNet with mozilla TTS ?

Add OpenMP support

Hi!
I'm trying to increase the speed of LPCNet with OpenMP (and want to make PR after). I ran the profiler, and it says 72% of the time was spent in sparse_sgemv_accum16 function. So I decided to parallelize that function first. But unfortunately, my code isn't working properly. Do you have any ideas what's wrong with the code below?

   const int *precomputed_idx[rows];
   const float *precomputed_weights[rows];
   for (i=0;i<rows;i+=16)
   {
      precomputed_weights[i] = weights;
      weights += 16 * (*idx);
      precomputed_idx[i] = idx++;
      idx += *precomputed_idx[i];
   }
   float test_out[4][2000];
   #pragma omp parallel num_threads(2)
   {
      int thread_id;
      const int *local_idx;
      float * restrict y;
      __m256 vy0, vy8;
      int cols;
      const float *local_weights;
      thread_id = omp_get_thread_num();
      #pragma omp for schedule(static)
      for (i=0;i<rows;i+=16)
      {  
         local_weights = precomputed_weights[i];
         y = &out[i];
         vy0 = _mm256_loadu_ps(&y[0]);
         vy8 = _mm256_loadu_ps(&y[8]);
         local_idx = precomputed_idx[i];
         cols = *local_idx++;
         for (j=0;j<cols;j++)
         {
            int id;
            __m256 vxj;
            __m256 vw;
            id = *local_idx++;
            vxj = _mm256_broadcast_ss(&x[id]);

            vw = _mm256_loadu_ps(&local_weights[0]);
            vy0 = _mm256_fmadd_ps(vw, vxj, vy0);

            vw = _mm256_loadu_ps(&local_weights[8]);
            vy8 = _mm256_fmadd_ps(vw, vxj, vy8);
            local_weights += 16;
         }
         _mm256_storeu_ps(&test_out[thread_id][i], vy0);
         _mm256_storeu_ps(&test_out[thread_id][i + 8], vy8);
      }
   }

I've got segmenation fault at line vxj = _mm256_broadcast_ss(&x[id]); because of defenitely incorrect id value. It always works without errors when I set the number of threads to 1.

post-processing tech on the generated waves

LPCNet performs very well on the features from ground-truth waves. However, when the features are predicted using an end2end model or SPSS model, there are some audible artifacts in the generated waves.

Anyone knows If there are some post-processing tech on the synthesized waves to reduce the artifacts.

Some synthesized samples:
e2e_lpcnet_samples_share.zip

how about to use 12.5ms for one frame instead of 10ms?

Can I modify the codes to train the lpcnet vocoder using frame size 12.5ms? Will it be as good as 10ms?

Illegal Instruction with "lpcnet_demo"

Hi,

Thanks for this nice work. I am new to use LPCNet and currently try to use "lpcnet_demo". Unfortunately, running it always gives me the following error message:

Illegal Instruction

Is there any clarification for possible reasons why I am getting such an error message? I have successfully built the project on my Linux machine and used "ffmpeg" command to convert from wav to PCM-16bits.

Regards

about dual-fc layer, the weight- ing vectors did not set in your code.

And you did not use the dual fc?