Giter Club home page Giter Club logo

caffe-speech-recognition's Introduction

Speech Recognition with BVLC caffe

Speech Recognition with the caffe deep learning framework

UPDATE: We are migrating to tensorflow

This project is quite fresh and only the first of three milestones is accomplished: Even now it might be useful if you just want to train a handful of commands/options (1,2,3..yes/no/cancel/...)

  1. training spoken numbers:
  • get spectogram training images from http://pannous.net/spoken_numbers.tar (470 MB)
  • start ./train.sh
  • test with ipython notebook test-speech-recognition.ipynb or caffe test ... or <caffe-root>/python/classify.py
  • 99% accuracy, nice!
  • online recognition and learning with ./recognition-server.py and ./record.py scripts

Sample spectrogram, That's what she said, too laid?

Sample spectrogram, Karen uttering 'zero' with 160 words per minute.

  1. training words:
  • 4GB of training data
  • net topology: work in progress ...
  • todo: use upcoming new caffe LSTM layers etc
  • UPDATE LSTMs get rolling, still not merged
  • UPDATE since the caffe project leaders have a hindering merging policy and this pull request was shifted many times without ever being merged, we are migrating to tensorflow
  • todo: add extra categories for a) silence b) common noises like typing, achoo c) ALL other noises
  1. training speech:

Theoretical background: papers

A. Graves and N. Jaitly. Towards end-to-end speech recognition with recurrent neural networks. In ICML, 2014

O. Vinyals, S. V. Ravuri, and D. Povey. Revisiting recurrent neural networks for robust ASR. In ICASSP, 2012

Andrew Ng et al / Baidu

Hinton et al / Toronto

good old Hinton

Schmidhuber et al using new 'ClockWork-RNNs'

The book: Automatic Speech Recognition: A Deep Learning Approach (Signals and Communication Technology) Hardcover โ€“ November 11, 2014 by Dong Yu (Author) and Li Deng (Author)

Related work

Also see the Kaldi project, which seems a bit messy but already uses deep learning with LSTM Another experimental LSTM network, which works out-of-the-box: Currennt

caffe-speech-recognition's People

Contributors

pannous avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

caffe-speech-recognition's Issues

Does not work "spoken numbers" example

Hi @pannous ,

I happy to find example like yours with audio classification. But I see that you need to update your code because it has some problems.

For now I am trying to use "training spoken numbers" example and I found doubts/problems:

  1. In file "numbers_solver.prototxt" you are using net: "numbers_net.autoencoder.prototxt". In "numbers_net.autoencoder.prototxt" are defined training and testing lists files ("train_index_256x256.txt", "test_index_256x256.txt"), but those files does not exist. But I fixed in "numbers_solver.prototxt" file net: "numbers_net.prototxt" . After that step I could start to created caffe model.

  2. When I tried to run backend server with "recognition-server.py", I got it:
    ... net = caffe.Net(model, weights)
    Traceback (most recent call last):
    File "", line 2, in
    Boost.Python.ArgumentError: Python argument types in
    Net.init(Net, str, str)
    did not match C++ signature:
    init(boost::python::api::object, std::string, std::string, int)
    init(boost::python::api::object, std::string, int)

  3. And it is not clear in some code you are using original size of images 512x512 and in another code you are reducing size 256x256. Because now I used original images to create model, but in code part "recognition-server.py" and "rocord.py" you are transforming image.

  4. And would like to get original audio files of "spoken numbers" and I want to know how did you made from wav to png?

I will be happy to get answer from you. I really like your audio classification example, just I think you need to update it.

Thanks!

Problem setting up a layer: Check failed ... layer inputs must have the same count.

running the given train.sh , I get the next failure, what can I do?
(not using GPU )
.......
I0730 17:25:47.946137 5825 net.cpp:120] Setting up loss
F0730 17:25:47.949908 5825 sigmoid_cross_entropy_loss_layer.cpp:26] Check failed: bottom[0]->count() == bottom[1]->count() (2097152 vs. 8388608) SIGMOID_CROSS_ENTROPY_LOSS layer inputs must have the same count.
*** Check failure stack trace: ***
@ 0x7ff780704daa (unknown)
@ 0x7ff780704ce4 (unknown)
@ 0x7ff7807046e6 (unknown)
@ 0x7ff780707687 (unknown)
@ 0x7ff780aeeb7d caffe::SigmoidCrossEntropyLossLayer<>::Reshape()
@ 0x7ff780b3e772 caffe::Net<>::Init()
@ 0x7ff780b404d2 caffe::Net<>::Net()
@ 0x7ff780b027d0 caffe::Solver<>::InitTrainNet()
@ 0x7ff780b03773 caffe::Solver<>::Init()
@ 0x7ff780b03946 caffe::Solver<>::Solver()
@ 0x40c920 caffe::GetSolver<>()
@ 0x406571 train()
@ 0x404ab1 main
@ 0x7ff77fc16ec5 (unknown)
@ 0x40505d (unknown)
@ (nil) (unknown)
./train.sh: line 1: 5825 Aborted (core dumped) caffe train -solver numbers_solver.prototxt

Check failed: ShapeEquals(proto) shape mismatch (reshape not set) ..?

Hey pannous, I managed to train the model over night and now trying to run the test-speech-recognition.ipynb.

For some reason i'm getting the following

F0722 12:16:15.248998 2071900928 blob.cpp:455] Check failed: ShapeEquals(proto) shape mismatch (reshape not set)
*** Check failure stack trace: ***

Have you come across this before? Or know what might be causing that error in the script?

Also I had to change the following line of code from:
net = caffe.Net(model,weights)

to:
net = caffe.Net(model,weights,caffe.TEST)

Thanks.

files not found - train examples

I downloaded file from your url in README.md file

but it doesn't contain proper files as 'train_index.txt'

E1217 14:37:13.242982 1953342208 io.cpp:77] Could not open or find file /data/spoken_numbers/3_Princess_220.wav.png
F1217 14:37:13.243538 1953342208 image_data_layer.cpp:59] Check failed: ReadImageToDatum(lines_[lines_id_].first, lines_[lines_id_].second, new_height, new_width, &datum)
*** Check failure stack trace: ***
@ 0x110140a6a google::LogMessage::Fail()
@ 0x11013fc88 google::LogMessage::SendToLog()
@ 0x1101406fa google::LogMessage::Flush()
@ 0x1101441e8 google::LogMessageFatal::~LogMessageFatal()
@ 0x110140f05 google::LogMessageFatal::~LogMessageFatal()
@ 0x10b66e467 caffe::ImageDataLayer<>::DataLayerSetUp()
@ 0x10b65373e caffe::BaseDataLayer<>::LayerSetUp()
@ 0x10b654e7e caffe::BasePrefetchingDataLayer<>::LayerSetUp()
@ 0x10b68defe caffe::Net<>::Init()
@ 0x10b68d18b caffe::Net<>::Net()
@ 0x10b6a4038 caffe::Solver<>::InitTrainNet()
@ 0x10b6a3a9f caffe::Solver<>::Init()
@ 0x10b6a394c caffe::Solver<>::Solver()
@ 0x10b5fb9f0 caffe::GetSolver<>()
@ 0x10b5f93ce train()
@ 0x10b5fb401 main
@ 0x7fff874d05c9 start

File not found: numbers_iter.solverstate ...?

Hey pannous, feel like im finally getting close to getting this to work but the train.sh script is failing a the very end because it can't find the numbers_iter.solverstate file. I can't seem to find a reference to it anywhere on my computer or online anywhere. Would you be able to describe what i'm doing wrong? Thanks.

Check failed: fd != -1 (-1 vs. -1) File not found: numbers_iter.solverstate
*** Check failure stack trace: ***
@ 0x10efb4448 google::LogMessage::Fail()
@ 0x10efb3bf4 google::LogMessage::SendToLog()
@ 0x10efb40a2 google::LogMessage::Flush()
@ 0x10efb74ed google::LogMessageFatal::~LogMessageFatal()
@ 0x10efb4753 google::LogMessageFatal::~LogMessageFatal()
@ 0x10a2af648 caffe::ReadProtoFromBinaryFile()
@ 0x10a29b1b5 caffe::Solver<>::Restore()
@ 0x10a29afa6 caffe::Solver<>::Solve()
@ 0x10a17f59d train()
@ 0x10a18181f main
@ 0x7fff985f95c9 start
@ 0x7 (unknown)
./train.sh: line 1: 1760 Abort trap: 6 ../../caffe/build/tools/caffe train -solver numbers_solver.prototxt -gpu 0 --snapshot=numbers_iter.solverstate

ValueError: could not broadcast input array from shape (256,256,3) into shape (512,512,3)

Hi pannous, like your effort in putting caffe speech recognition examples here!

I managed to run the training and got the model. I run python/classify.py to get the output but got the following error:
ValueError: could not broadcast input array from shape (256,256,3) into shape (512,512,3)
(have tried with the ipynb and the broadcasting error occurs too)

Any idea if this is an issue with the training parameters (e.g. input images have to be 256x256) or is this an issue with how I run the classify.py?

Many thanks in advance!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.