xiph / lpcnet Goto Github PK
View Code? Open in Web Editor NEWEfficient neural speech synthesis
License: BSD 3-Clause "New" or "Revised" License
Efficient neural speech synthesis
License: BSD 3-Clause "New" or "Revised" License
Hello @jmvalin,
I spent 1s to synthesise an audio 4s with C on cpu. How should i do to run it with c on gpu to improve it's speed? Thank you very much.
I train the model on another dataset, the features size is :
It seems that all data is loaded into CPU memory. My machine has 100GB+ CPU memory, and 8GB GPU memory. I met the memory error, when do train_lpcnet.py.
lambda_1 (Lambda) (None, None, 128) 0 feature_dense2[0][0]
feature_dense2[0][0]
__________________________________________________________________________________________________
concatenate_2 (Concatenate) (None, None, 512) 0 reshape_1[0][0]
reshape_2[0][0]
lambda_1[0][0]
__________________________________________________________________________________________________
gru_a (CuDNNGRU) [(None, None, 384), 1034496 concatenate_2[0][0]
__________________________________________________________________________________________________
concatenate_3 (Concatenate) (None, None, 512) 0 gru_a[0][0]
lambda_1[1][0]
__________________________________________________________________________________________________
gru_b (CuDNNGRU) [(None, None, 16), ( 25440 concatenate_3[0][0]
__________________________________________________________________________________________________
dual_fc (MDense) (None, None, 256) 9216 gru_b[0][0]
==================================================================================================
Total params: 1,259,334
Trainable params: 1,259,334
Non-trainable params: 0
__________________________________________________________________________________________________
Traceback (most recent call last):
File "./src/train_lpcnet.py", line 104, in <module>
pred_in = ulaw2lin(in_data)
File "/home/storage11/heyunchao/Torch_Programs/LPCNet/src/ulaw.py", line 11, in ulaw2lin
return s*scale_1*(np.exp(u/128.*math.log(256))-1)
MemoryError
Could someone tell me what I am wrong?
can gen wav in realtime in an A53 chip?
Why I definately follow the README.md and the loss curve doesn't descent at all?
The curve keeps bigger than 5.4
As of January 1 2019, Mozilla requires that all GitHub projects include this CODE_OF_CONDUCT.md file in the project root. The file has two parts:
If you have any questions about this file, or Code of Conduct policies and procedures, please see Mozilla-GitHub-Standards or email [email protected].
(Message COC001)
why sometimes it did not have the noise in the start even that they are initialised to 0.
When compile for test, make test_lpcnet
, I met the following error. Does anyone know why?
make test_lpcnet
gcc -Wall -W -Wextra -Wno-unused-function -O3 -g -I../include -mavx2 -mfma -c -o src/nnet.o src/nnet.c
In file included from src/nnet.c:45:0:
src/vec_avx.h: In function ‘sgemv_accum16’:
src/vec_avx.h:167:24: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘y’
float * restrict y;
^
src/vec_avx.h:167:24: error: ‘y’ undeclared (first use in this function)
src/vec_avx.h:167:24: note: each undeclared identifier is reported only once for each function it appears in
src/vec_avx.h: In function ‘sparse_sgemv_accum16’:
src/vec_avx.h:193:24: error: expected ‘=’, ‘,’, ‘;’, ‘asm’ or ‘__attribute__’ before ‘y’
float * restrict y;
^
src/vec_avx.h:193:24: error: ‘y’ undeclared (first use in this function)
make: *** [src/nnet.o] Error 1
i work on a centos 7.2 linux system
when I do make, I encouter problem like:
src/vec.h:141:27: error: expected '=', ',', ';', 'asm' or 'attribute' before 'y'
float * restrict y;
restrict need complile parameter:-std=c99
mysolution is:
export CFLAGS='-O3 -g -mavx2 -mfma -std=c99'
./configure
make
It seems that there is no test to see if the input ends on the second pass. So if the input data has less than 5M frames it will still generate huge features files.
I suggest changing https://github.com/mozilla/LPCNet/blob/master/src/dump_data.c#L264 to
if (!training || one_pass_completed) break;
Hi, how to build on VS2015
@jmvalin , @drowe67
In your code,
https://github.com/mozilla/LPCNet/blob/b811cade95cf19530ddfbe5aadeaff7c4f89ba77/src/dump_data.c#L160-L162
the first line is pitch period?
the second line is pitch correlation?what does it stands for?
what's the third line?
And your pitch estimation code is hard to read without any comments,
Can you provide some explanation, or provide links to related papers or materials?
By the way, thanks for sharing your excellent work!
I am trying to figure out how to calculate the voicing decision from the pitch lag and correlation but I am making no headway. It seems that there is a moving threshold in pitch.c, but I cannot quite figure what's going on. I have put complete noise and complete voiced samples through dump_data and at the moment it seems that unvoiced frames get a correlation value < 0.2 (that is features[37:38]).
I have had a look at the paper as well as some Opus and SILK papers but I am not making any progress.
Does anybody maybe know how to get the voicing decision from these features?
Hi, I note that there is no nnet_data.c in the src folder, so how can I get this file?
Thank you.
Hi . Thanks for the code first.
When I run the training, I got an issue below. I just follow the simple instructions and did not add extra input. Do you know why this happens ? Thank you very much.
Epoch 1/120
Traceback (most recent call last):
File "./src/train_lpcnet.py", line 150, in
model.fit([in_data, in_exc, features, periods], out_data, batch_size=batch_size, epochs=nb_epochs, validation_split=0.0, callbacks=[checkpoint, lpcnet.Sparsify(2000, 40000, 400, (0.1, 0.1, 0.1))])
File "/usr/local/lib/python3.6/site-packages/keras/engine/training.py", line 1039, in fit
validation_steps=validation_steps)
File "/usr/local/lib/python3.6/site-packages/keras/engine/training_arrays.py", line 199, in fit_loop
outs = f(ins_batch)
File "/usr/local/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2715, in call
return self._call(inputs)
File "/usr/local/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py", line 2675, in _call
fetched = self._callable_fn(*array_vals)
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1382, in call
run_metadata_ptr)
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 519, in exit
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: Incompatible shapes: [153600] vs. [64,2400]
[[Node: metrics_1/sparse_categorical_accuracy/Equal = Equal[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](metrics_1/sparse_categorical_accuracy/Reshape, metrics_1/sparse_categorical_accuracy/Cast)]]
[[Node: loss_1/mul/_237 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2235_loss_1/mul", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
/* -40 dB noise floor. */
ac[0] += ac[0]*1e-4 + 320/12/38.;
why add 320/12/38. ?
What number should be for 44.1KHz ?
Thank you very much ~
Hi,
I note that there is no concat.sh in the src folder, so how can I generate the input file?
Thank you.
When i tried to ran python src/train_lpcnet.py features.f32 data.u8
, it thrown error like below
==================================================================================================
Total params: 1,239,904
Trainable params: 170,752
Non-trainable params: 1,069,152
__________________________________________________________________________________________________
ulaw std = 27.343956536346006
Traceback (most recent call last):
File "src/train_lpcnet.py", line 109, in <module>
model.load_weights('lpcnet20c_384_10_G16_80.h5')
File "/usr/local/lib/python3.6/site-packages/keras/engine/network.py", line 1166, in load_weights
f, self.layers, reshape=reshape)
File "/usr/local/lib/python3.6/site-packages/keras/engine/saving.py", line 1030, in load_weights_from_hdf5_group
str(len(filtered_layers)) + ' layers.')
ValueError: You are trying to load a weight file containing 10 layers into a model with 9 layers.
Hello,
Is there a possibility to generate the speech waveform from an unquantized version of the coding parameters?
Is it possible to encode from wav and decode to wav?
For example I downloaded wav speech from https://www.voiptroubleshooter.com/open_speech/american/OSR_us_000_0010_8k.wav
ffmpeg -i demo_data/OSR_us_000_0010_8k.wav
ffmpeg version 4.1 Copyright (c) 2000-2018 the FFmpeg developers
built with Apple LLVM version 10.0.0 (clang-1000.10.44.4)
configuration: --prefix=/usr/local/Cellar/ffmpeg/4.1 --enable-shared --enable-pthreads --enable-version3 --enable-hardcoded-tables --enable-avresample --cc=clang --host-cflags= --host-ldflags= --enable-ffplay --enable-gpl --enable-libmp3lame --enable-libopus --enable-libsnappy --enable-libtheora --enable-libvorbis --enable-libvpx --enable-libx264 --enable-libx265 --enable-libxvid --enable-lzma --enable-libvidstab --enable-opencl --enable-videotoolbox
libavutil 56. 22.100 / 56. 22.100
libavcodec 58. 35.100 / 58. 35.100
libavformat 58. 20.100 / 58. 20.100
libavdevice 58. 5.100 / 58. 5.100
libavfilter 7. 40.101 / 7. 40.101
libavresample 4. 0. 0 / 4. 0. 0
libswscale 5. 3.100 / 5. 3.100
libswresample 3. 3.100 / 3. 3.100
libpostproc 55. 3.100 / 55. 3.100
Guessed Channel Layout for Input Stream #0.0 : mono
Input #0, wav, from 'demo_data/OSR_us_000_0010_8k.wav':
Duration: 00:00:33.62, bitrate: 128 kb/s
Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 8000 Hz, mono, s16, 128 kb/s
At least one output file must be specified
soxi demo_data/OSR_us_000_0010_8k.wav
Input File : 'demo_data/OSR_us_000_0010_8k.wav'
Channels : 1
Sample Rate : 8000
Precision : 16-bit
Duration : 00:00:33.62 = 268985 samples ~ 2521.73 CDDA sectors
File Size : 538k
Bit Rate : 128k
Sample Encoding: 16-bit Signed Integer PCM
convert wav to pcm:
ffmpeg -y -i demo_data/OSR_us_000_0010_8k.wav -acodec pcm_s16le -f s16le -ac 1 -ar 16000 demo_data/input.pcm
real 0m0.039s
user 0m0.027s
sys 0m0.011s
encode:
time ./lpcnet_demo -encode demo_data/input.pcm demo_data/compressed.bin
real 0m0.688s
user 0m0.671s
sys 0m0.015s
decode(seems decode takes long on CPU):
time ./lpcnet_demo -decode demo_data/compressed.bin demo_data/output.pcm
real 0m7.000s
user 0m6.966s
sys 0m0.027s
convert pcm to wav:
time ffmpeg -f s16le -ar 16000 -ac 1 -i demo_data/output.pcm demo_data/output.wav
real 0m0.026s
user 0m0.015s
sys 0m0.010s
I wonder are my ffmpeg settings are correct?
Looks like here is some example using sox
:
#4 (comment)
Hello,
I would like to connect a Tacotron2 model to LPCNet.
Is there a way to convert the 80-mel coefficients (output of Taco2) into the 18 Bark scale + 2 pitch parameters (input of LPCNet) ?
And somehow related, when reading about the Bark scale like here on wikipedia, there is usually 24 coefficients, and I don't understand how they are only 18 computed here. Even taking into account the 16kHz sampling, that would leave 22 of them, right ?
Thanks a lot :)
I've used current model and tried first with given demo and found that quality is not as good as demo on xiph.org. Then I trained on my own gpu with all default settings (McGill database) and quality is better but worse then the demo. What could be the reason?
I have trained a model using the old code, and the model performs very well. In the recent 5 days, there are some new code update. Could @jmvalin or someone else take some explanation about these new update? Do I need to re-train the model using the new dump_data
bin?
Could I know about the speed of LPCNet?
Hi all
In my experiment for training LPCNet for TTS, it is found the LPCNet hard to converge and very like to generate local minimum . And the voice quality generated may vary in a large range .
I have tried to reduce learning rate or change batch size or change training criterion, but these seems not helping.
Do you have the same problem and any suggestions for training a stable LPCNet Vocoder quickly ?
Thanks
In lpcnet.py, the following line do not seem to have the intended effect:
if self.batch < self.t_start or ((self.batch-self.t_start) % self.interval != 0 and self.batch < self.t_end):
#print("don't constrain");
pass
I assume the intention is to run the "sparsification" every self.interval
from self.t_start
to self.t_end
. However, what it actually does is, in addition to this, run it for every single batch after self.t_end
.
So I suggest the following correction:
if self.batch < self.t_start or ((self.batch-self.t_start) % self.interval != 0) or self.batch >= self.t_end:
#print("don't constrain");
pass
@jmvalin Thanks for hosting this interesting project. As part of the usecases for LPCNet you mention about TTS(Text-To-Speech). How do we synthesis speech from text using the test_lpcnet.py?
If this is not the approach to implement TTS, do you have any recommendation on where to start with LPCNet for implementing end to end TTS system?
I would like to point out that identifiers like “_LPCNET_H_
” and “_NNET_H_
” do not fit to the expected naming convention of the C language standard.
Would you like to adjust your selection for unique names?
Your paper is very impressive. 🙂
I'd like to train a model using 24kHz data instead of 16kHz data.
Is it easily possible?
I trained the model on my own data-set. The generated voice is very good, except the problem of noise in the wave start.
The reason I think is that, the training data is a random clip from the merged voice, so there are many training data without silence start, any only few data is start with silence.
Does anyone also have this problem?
have anyone experiment it? and what is the result?
train_lpcnet.py
pcm_file = 'test2.s16' # 16 bit unsigned short PCM samples
data = np.fromfile(pcm_file, dtype='uint8')
sig = np.reshape(data[0::4], (nb_frames, pcm_chunk_size, 1))
pred = np.reshape(data[1::4], (nb_frames, pcm_chunk_size, 1))
in_exc = np.reshape(data[2::4], (nb_frames, pcm_chunk_size, 1))
out_exc = np.reshape(data[3::4], (nb_frames, pcm_chunk_size, 1))
the pcm_file is 16-bit file so it should be split to 2 parts,but there are four variables.
What do these four variables represent?Thanks
line 101, in implementation of frame_analysis:
RNN_COPY(st->analysis_mem, &in[FRAME_SIZE-OVERLAP_SIZE], OVERLAP_SIZE);
should be "RNN_COPY(&(st->analysis_mem[OVERLAP_SIZE-FRAME_SIZE]), in, FRAME_SIZE)" ?
Hi . I have another question. Why I used the different data, one is about 8 times larger than the other.
The extracted feature is identical the same size ? Is there anything wrong ?
Best
sHi, I find that when doing sparsification on GRU_A during training process, the sparsified weights (with shape (384, 1152)) are vertical strips of 1. However, when dumping the model, the weight matrix (with shape (384,1152)) are horizontal strips of non-zero values.
Why is this happening? This might be a silly question, but it troubled me the entire day.
Thanks!
have anyone tried it?
Would you like to add more error handling for return values from functions like the following?
FYI: The following changes were made to this repository's wiki:
defacing spam has been removed
Restricting write access to contributors is strongly encouraged. Please make that change (documentation).
These were made as the result of a recent automated defacement of publically writeable wikis.
I'm trying to transfer this project to pytorch, and i'd like to know what should i change if i want to reuse the c files. Do i only need to change the pytorch model into the keras-like model? @jmvalin I'm now met troubles with the c program when running the sparse gru layer, the results differ from the pytorch model.
how to connect LPCNet with mozilla TTS ?
Hi!
I'm trying to increase the speed of LPCNet with OpenMP (and want to make PR after). I ran the profiler, and it says 72% of the time was spent in sparse_sgemv_accum16
function. So I decided to parallelize that function first. But unfortunately, my code isn't working properly. Do you have any ideas what's wrong with the code below?
const int *precomputed_idx[rows];
const float *precomputed_weights[rows];
for (i=0;i<rows;i+=16)
{
precomputed_weights[i] = weights;
weights += 16 * (*idx);
precomputed_idx[i] = idx++;
idx += *precomputed_idx[i];
}
float test_out[4][2000];
#pragma omp parallel num_threads(2)
{
int thread_id;
const int *local_idx;
float * restrict y;
__m256 vy0, vy8;
int cols;
const float *local_weights;
thread_id = omp_get_thread_num();
#pragma omp for schedule(static)
for (i=0;i<rows;i+=16)
{
local_weights = precomputed_weights[i];
y = &out[i];
vy0 = _mm256_loadu_ps(&y[0]);
vy8 = _mm256_loadu_ps(&y[8]);
local_idx = precomputed_idx[i];
cols = *local_idx++;
for (j=0;j<cols;j++)
{
int id;
__m256 vxj;
__m256 vw;
id = *local_idx++;
vxj = _mm256_broadcast_ss(&x[id]);
vw = _mm256_loadu_ps(&local_weights[0]);
vy0 = _mm256_fmadd_ps(vw, vxj, vy0);
vw = _mm256_loadu_ps(&local_weights[8]);
vy8 = _mm256_fmadd_ps(vw, vxj, vy8);
local_weights += 16;
}
_mm256_storeu_ps(&test_out[thread_id][i], vy0);
_mm256_storeu_ps(&test_out[thread_id][i + 8], vy8);
}
}
I've got segmenation fault at line vxj = _mm256_broadcast_ss(&x[id]);
because of defenitely incorrect id value. It always works without errors when I set the number of threads to 1.
LPCNet performs very well on the features from ground-truth waves. However, when the features are predicted using an end2end model or SPSS model, there are some audible artifacts in the generated waves.
Anyone knows If there are some post-processing tech on the synthesized waves to reduce the artifacts.
Some synthesized samples:
e2e_lpcnet_samples_share.zip
Can I modify the codes to train the lpcnet vocoder using frame size 12.5ms? Will it be as good as 10ms?
Hi,
Thanks for this nice work. I am new to use LPCNet and currently try to use "lpcnet_demo". Unfortunately, running it always gives me the following error message:
Illegal Instruction
Is there any clarification for possible reasons why I am getting such an error message? I have successfully built the project on my Linux machine and used "ffmpeg" command to convert from wav to PCM-16bits.
Regards
And you did not use the dual fc?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.