pannous / caffe-speech-recognition Goto Github PK

View Code? Open in Web Editor NEW

323.0 45.0 126.0 63.93 MB

Speech Recognition with the Caffe deep learning framework, migrating to

Home Page: https://github.com/pannous/tensorflow-speech-recognition

Python 31.35% Ruby 7.05% Swift 2.44% Shell 0.25% Jupyter Notebook 58.93%

caffe-speech-recognition's Introduction

Speech Recognition with BVLC caffe

Speech Recognition with the caffe deep learning framework

UPDATE: We are migrating to tensorflow

This project is quite fresh and only the first of three milestones is accomplished: Even now it might be useful if you just want to train a handful of commands/options (1,2,3..yes/no/cancel/...)

training spoken numbers:

get spectogram training images from http://pannous.net/spoken_numbers.tar (470 MB)
start ./train.sh
test with ipython notebook test-speech-recognition.ipynb or caffe test ... or <caffe-root>/python/classify.py
99% accuracy, nice!
online recognition and learning with ./recognition-server.py and ./record.py scripts

Sample spectrogram, Karen uttering 'zero' with 160 words per minute.

training words:

4GB of training data
net topology: work in progress ...
todo: use upcoming new caffe LSTM layers etc
UPDATE LSTMs get rolling, still not merged
UPDATE since the caffe project leaders have a hindering merging policy and this pull request was shifted many times without ever being merged, we are migrating to tensorflow
todo: add extra categories for a) silence b) common noises like typing, achoo c) ALL other noises

training speech:

todo!
100GB of training data here: http://www.openslr.org/12/
TIMIT dataset $27,000.00 membership fee or $250 for non-members+$2400 under research-only license?
combine with google n-grams

Theoretical background: papers

A. Graves and N. Jaitly. Towards end-to-end speech recognition with recurrent neural networks. In ICML, 2014

O. Vinyals, S. V. Ravuri, and D. Povey. Revisiting recurrent neural networks for robust ASR. In ICASSP, 2012

Andrew Ng et al / Baidu

Hinton et al / Toronto

good old Hinton

Schmidhuber et al using new 'ClockWork-RNNs'

The book: Automatic Speech Recognition: A Deep Learning Approach (Signals and Communication Technology) Hardcover – November 11, 2014 by Dong Yu (Author) and Li Deng (Author)

Related work

Also see the Kaldi project, which seems a bit messy but already uses deep learning with LSTM Another experimental LSTM network, which works out-of-the-box: Currennt

caffe-speech-recognition's People

Contributors

Stargazers

Watchers

Forkers

avalada stevenlol dreadlord1984 jwmneu aibrahim faisal-w 7404n yheno jwgu jorditorrentsguillen omar-florez sonach bx5974 colingogo yiiwood karthiknrao matrixplayer fireae kmfeng hidekazuoki jiecaoyu rohithkodali tracy02022 deepxkn satwantkumar vzhangmeng726 daidengxin xuanhan863 hariag youyou3 joeking11829 caidongyun anjith2006 burakdev eecqin caomw copyfun junteudjio imclab lyqsr bin2000 guopd wait1988 keeganren nagyistoce trigrass2 v-italy yanleirex kentchun33333 nagyistge cloudherods weilamchung lijian8 wyw636 bullud sherriiie zhangjiulong pwkalana9 chagge pangcong audioderant sagaruprety feng520893 lvye1937 ming-hai lihao0214 manoharsai sunxingxingtf clcarwin zentiment treiden wenlin-zhang gongxijun erkhemee realitian ram-cse thorwm ilibx statml hbinol suzhenghang autohe liyijin coderx7 mozartvn kuonanhong lucashcosta lwgkzl kevinyang007 af258963 h2016102 uncledickhe amrit778 carabob b2220333 shubhampachori12110095 qzshucsz fadwaalazzo madhumitagv sparkdl

caffe-speech-recognition's Issues

Cool ! 100% correct for digits

Reach 1 at 800 iteration, then drop to 0.92 -> 0.96 and return to 1.

May I try on words?

Info about the spoken_words dataset

I'd like to have more information about the dataset: what it contains exactly, and what's the licence of use. Thanks.

Could you please give an idea on how to generate our own dataset with our voices. In the dataset, what does the number at the end represent. Eg: in the image 0_Karen_160.png , what is 160?

Does not work "spoken numbers" example

Hi @pannous ,

I happy to find example like yours with audio classification. But I see that you need to update your code because it has some problems.

For now I am trying to use "training spoken numbers" example and I found doubts/problems:

In file "numbers_solver.prototxt" you are using net: "numbers_net.autoencoder.prototxt". In "numbers_net.autoencoder.prototxt" are defined training and testing lists files ("train_index_256x256.txt", "test_index_256x256.txt"), but those files does not exist. But I fixed in "numbers_solver.prototxt" file net: "numbers_net.prototxt" . After that step I could start to created caffe model.
When I tried to run backend server with "recognition-server.py", I got it:
... net = caffe.Net(model, weights)
Traceback (most recent call last):
File "", line 2, in
Boost.Python.ArgumentError: Python argument types in
Net.init(Net, str, str)
did not match C++ signature:
init(boost::python::api::object, std::string, std::string, int)
init(boost::python::api::object, std::string, int)
And it is not clear in some code you are using original size of images 512x512 and in another code you are reducing size 256x256. Because now I used original images to create model, but in code part "recognition-server.py" and "rocord.py" you are transforming image.
And would like to get original audio files of "spoken numbers" and I want to know how did you made from wav to png?

I will be happy to get answer from you. I really like your audio classification example, just I think you need to update it.

Thanks!

Problem setting up a layer: Check failed ... layer inputs must have the same count.

running the given train.sh , I get the next failure, what can I do?
(not using GPU )
.......
I0730 17:25:47.946137 5825 net.cpp:120] Setting up loss
F0730 17:25:47.949908 5825 sigmoid_cross_entropy_loss_layer.cpp:26] Check failed: bottom[0]->count() == bottom[1]->count() (2097152 vs. 8388608) SIGMOID_CROSS_ENTROPY_LOSS layer inputs must have the same count.
*** Check failure stack trace: ***
@ 0x7ff780704daa (unknown)
@ 0x7ff780704ce4 (unknown)
@ 0x7ff7807046e6 (unknown)
@ 0x7ff780707687 (unknown)
@ 0x7ff780aeeb7d caffe::SigmoidCrossEntropyLossLayer<>::Reshape()
@ 0x7ff780b3e772 caffe::Net<>::Init()
@ 0x7ff780b404d2 caffe::Net<>::Net()
@ 0x7ff780b027d0 caffe::Solver<>::InitTrainNet()
@ 0x7ff780b03773 caffe::Solver<>::Init()
@ 0x7ff780b03946 caffe::Solver<>::Solver()
@ 0x40c920 caffe::GetSolver<>()
@ 0x406571 train()
@ 0x404ab1 main
@ 0x7ff77fc16ec5 (unknown)
@ 0x40505d (unknown)
@ (nil) (unknown)
./train.sh: line 1: 5825 Aborted (core dumped) caffe train -solver numbers_solver.prototxt

Check failed: ShapeEquals(proto) shape mismatch (reshape not set) ..?

Hey pannous, I managed to train the model over night and now trying to run the test-speech-recognition.ipynb.

For some reason i'm getting the following

F0722 12:16:15.248998 2071900928 blob.cpp:455] Check failed: ShapeEquals(proto) shape mismatch (reshape not set)
*** Check failure stack trace: ***

Have you come across this before? Or know what might be causing that error in the script?

Also I had to change the following line of code from:
net = caffe.Net(model,weights)

to:
net = caffe.Net(model,weights,caffe.TEST)

Thanks.

broken link http://pannous.net/spoken_words.tar

Hi, there is a broken link

http://pannous.net/spoken_words.tar

Train speech on Librispeech dataset

The README lists training speech as a TODO, with a note that the TIMIT dataset costs $27k. A free alternative is the 1,000-hour Librispeech corpus. [1]

[1] http://www.danielpovey.com/files/2015_icassp_librispeech.pdf

Any idea on how to deploy it ?

To say I have the trained model, and would like to run a testcase, maybe just on the test_index.txt ?

files not found - train examples

I downloaded file from your url in README.md file

but it doesn't contain proper files as 'train_index.txt'

E1217 14:37:13.242982 1953342208 io.cpp:77] Could not open or find file /data/spoken_numbers/3_Princess_220.wav.png
F1217 14:37:13.243538 1953342208 image_data_layer.cpp:59] Check failed: ReadImageToDatum(lines_[lines_id_].first, lines_[lines_id_].second, new_height, new_width, &datum)
*** Check failure stack trace: ***
@ 0x110140a6a google::LogMessage::Fail()
@ 0x11013fc88 google::LogMessage::SendToLog()
@ 0x1101406fa google::LogMessage::Flush()
@ 0x1101441e8 google::LogMessageFatal::~LogMessageFatal()
@ 0x110140f05 google::LogMessageFatal::~LogMessageFatal()
@ 0x10b66e467 caffe::ImageDataLayer<>::DataLayerSetUp()
@ 0x10b65373e caffe::BaseDataLayer<>::LayerSetUp()
@ 0x10b654e7e caffe::BasePrefetchingDataLayer<>::LayerSetUp()
@ 0x10b68defe caffe::Net<>::Init()
@ 0x10b68d18b caffe::Net<>::Net()
@ 0x10b6a4038 caffe::Solver<>::InitTrainNet()
@ 0x10b6a3a9f caffe::Solver<>::Init()
@ 0x10b6a394c caffe::Solver<>::Solver()
@ 0x10b5fb9f0 caffe::GetSolver<>()
@ 0x10b5f93ce train()
@ 0x10b5fb401 main
@ 0x7fff874d05c9 start

File not found: numbers_iter.solverstate ...?

Hey pannous, feel like im finally getting close to getting this to work but the train.sh script is failing a the very end because it can't find the numbers_iter.solverstate file. I can't seem to find a reference to it anywhere on my computer or online anywhere. Would you be able to describe what i'm doing wrong? Thanks.

Check failed: fd != -1 (-1 vs. -1) File not found: numbers_iter.solverstate
*** Check failure stack trace: ***
@ 0x10efb4448 google::LogMessage::Fail()
@ 0x10efb3bf4 google::LogMessage::SendToLog()
@ 0x10efb40a2 google::LogMessage::Flush()
@ 0x10efb74ed google::LogMessageFatal::~LogMessageFatal()
@ 0x10efb4753 google::LogMessageFatal::~LogMessageFatal()
@ 0x10a2af648 caffe::ReadProtoFromBinaryFile()
@ 0x10a29b1b5 caffe::Solver<>::Restore()
@ 0x10a29afa6 caffe::Solver<>::Solve()
@ 0x10a17f59d train()
@ 0x10a18181f main
@ 0x7fff985f95c9 start
@ 0x7 (unknown)
./train.sh: line 1: 1760 Abort trap: 6 ../../caffe/build/tools/caffe train -solver numbers_solver.prototxt -gpu 0 --snapshot=numbers_iter.solverstate

How to generate those image?

Need go home now, you can point me to the file , I can help you to write some docs.

ValueError: could not broadcast input array from shape (256,256,3) into shape (512,512,3)

Hi pannous, like your effort in putting caffe speech recognition examples here!

I managed to run the training and got the model. I run python/classify.py to get the output but got the following error:
ValueError: could not broadcast input array from shape (256,256,3) into shape (512,512,3)
(have tried with the ipynb and the broadcasting error occurs too)

Any idea if this is an issue with the training parameters (e.g. input images have to be 256x256) or is this an issue with how I run the classify.py?

Many thanks in advance!