rayanwang / speech_emotion_recognition_blstm Goto Github PK

View Code? Open in Web Editor NEW

260.0 8.0 77.0 72 KB

Bidirectional LSTM network for speech emotion recognition.

License: MIT License

Python 100.00%

emotion-recognition speech-emotion-recognition voice-segmentation attention-model lstm-sentiment-analysis

speech_emotion_recognition_blstm's Introduction

Speech_emotion_recognition_BLSTM

Bidirectional LSTM network for speech emotion recognition.

Environment:

Python 2.7/3.6
NVIDIA Geforce GTX 1060 6GB
Conda version 4.5

Dependencies

Tensorflow(1.6) for the backend of keras
keras(2.1.5) for building/training the Bidirectional LSTM network
librosa for audio resampling
pyAudioAnalysis for feature engineering
scikit learn for k-fold cross validation
Hyperas for fine-tuning hyper parameters and find best model
webrtcvad for sentence extraction
pydub for wav extraction

Datasets

Berlin speech dataset

Usage

Since the function "stFeatureSpeed" in pyAudioAnalysis is default unworkable, you have to modify the code in audioFeatureExtraction.py (for index related issue, just cast the value type to integer; for the issue in method stHarmonic, cast M to integer(M = int(M); Comment out the invocation of method 'mfccInitFilterBanks' in stFeatureSpeed).
If you run the code in python 3, please upgrade pyAudioAnalysis to the latest version that compatible with python 3.
You have to prepare at least two different sets of data, one for find the best model and the other for testing.

Long option	Option	Description
--dataset	-d	dataset type
--dataset_path	-p	dataset or the predicted data path
--load_data	-l	load dataset and dump the data stream to a .p file
--feature_extract	-e	extract features from data and dump to a .p file
--model_path	-m	the model path you want to load
--nb_classes	-c	the number of classes of your data
--speaker_indipendence	-s	cross validation is made using different actors for train and test sets

Example find_best_model.py:

python find_best_model.py -d "berlin" -p [berlin data path] -l -e -c 7

The first time you run the script, -l and -e options are mandatory since you need to load data and extract features.
Every time you change the training data and/or the method of feature engineering, you have to specify -l and/or -e respectively to update your .p files.
You can also modify the code for tuning other hyper parameters.

Example prediction.py:

python prediction.py -p [data path] -m [model path] -c 7

Example model_cross_validation.py:

python model_cross_validation.py -d "berlin" -p [berlin data path] -l -e -c 7

Use -s for k-fold cross validation in different actors.

Experimental result

Use hyperas for tuning optimizers, batch_size and epochs, the remaining parameters are the values applied to the paper below.
The average accuracy is about 68.60%(+/- 1.88%, through 10-fold cross validation, using Berlin dataset).

References

S. Mirsamadi, E. Barsoum, and C. Zhang, “Automatic speech emotion recognition using recurrent neural networks with local attention,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, U.S.A., Mar. 2017, IEEE, pp. 2227–2231.
Fei Tao, Gang Liu, “Advanced LSTM: A Study about Better Time Dependency Modeling in Emotion Recognition,” Submitted to 2018 IEEE International Conference on Acoustics, Speech and Signal Processing.
Video from Microsoft Research

Future work

The training data I list above (Berlin) may insufficient, the validation accuracy and loss can't be improved while the training result is not good.
Given sufficient training examples, the parameters of short-term characterization, long-term aggregation, and the attention model can be jointly optimized for best performance.
Update the current network architecture to improve the accuracy (already in progress).

speech_emotion_recognition_blstm's People

Contributors

Stargazers

Watchers

speech_emotion_recognition_blstm's Issues

TypeError: slice indices must be integers or None or have an index method

Hi! Rayan! I'm sorry to disturb you, I have a issue :

python find_best_model.py -d "berlin" -p E:\TensorFlow\Emo-DB\wav -l -e -c 7
Using TensorFlow backend.
Writing berlin data set to file...
Traceback (most recent call last):
File "find_best_model.py", line 167, in
extract_dataset(ds.data, nb_samples=len(ds.targets), dataset=dataset)
File "E:\TensorFlow\GitHub\Speech_emotion_recognition_BLSTM-master\utility\audio.py", line 75, in extract_dataset
hr_pitch = ShortTermFeatures.speed_feature(x, Fs, globalvars.frame_size * Fs, globalvars.step * Fs)
File "C:\Users\asus\Anaconda3\envs\tensorflow\lib\site-packages\pyAudioAnalysis\ShortTermFeatures.py", line 485, in speed_feature
x = signal[cur_p:cur_p + window]
TypeError: slice indices must be integers or None or have an index method

please help me about that! Thanks very much!

Test Accuracy

Hi RayanWang,
I have started the model training with the use of find_best_model.py.Below validation accuracy was achieved.
While the training process is ongoing 200 epochs, highest validation accuracy was 0.5652(early stopping at 35)

While the training process is ongoing 100 epochs, highest validation accuracy was 0.3354(early stopping at 19)
I want to increase the testing accuracy more (at least 70).what are additional things and modifications should I follow.
Thank u.

i want to dataset is a folder

input of your code is file pickle or folder (wavs file)? i want to input is a folder so how?

Error about mfccInitFilterBanks()

Hello RanyanWang,I read your code and want to run it.But I catch the erorr "TypeError: mfccInitFilterBanks() takes exactly 2 arguments (7 given)",then I delete five arguments But new error appear "TypeError: 'float' object cannot be interpreted as an index",Can you tell me how to modify the code in audioFeatureExtraction.py?Just delete the code of stFeatureSpeed?

How to modify the code in audioFeatureExtraction.py to fix this error

=========================================================
Writing berlin data set to file...
Traceback (most recent call last):
File "/home/lwin/speech-emotion/Speech_emotion_recognition_BLSTM-master/find_best_model.py", line 163, in
functions.feature_extract(ds.data, nb_samples=len(ds.targets), dataset=dataset)
File "/home/lwin/speech-emotion/Speech_emotion_recognition_BLSTM-master/utility/functions.py", line 20, in feature_extract
hr_pitch = audioFeatureExtraction.stFeatureSpeed(x, Fs, globalvars.frame_size * Fs, globalvars.step * Fs)
File "/usr/local/lib/python3.5/dist-packages/pyAudioAnalysis/audioFeatureExtraction.py", line 669, in stFeatureSpeed
[fbank, freqs] = mfccInitFilterBanks(Fs, nfft, lowfreq, linsc, logsc, nlinfil, nlogfil)
TypeError: mfccInitFilterBanks() takes 2 positional arguments but 7 were given

=================================================================
How to solve this issue.plz help me .Thank u

TypeError: mfccInitFilterBanks() takes 2 positional arguments but 7 were given

/usr/anaconda3/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
return f(*args, **kwds)
/usr/anaconda3/lib/python3.6/site-packages/h5py/init.py:34: FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type.
from ._conv import register_converters as _register_converters
Using TensorFlow backend.
/usr/anaconda3/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
return f(*args, **kwds)
/usr/anaconda3/lib/python3.6/importlib/_bootstrap.py:219: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
return f(*args, **kwds)
/home/shakey/.local/lib/python3.6/site-packages/pydub/utils.py:165: RuntimeWarning: Couldn't find ffmpeg or avconv - defaulting to ffmpeg, but may not work
warn("Couldn't find ffmpeg or avconv - defaulting to ffmpeg, but may not work", RuntimeWarning)
Writing berlin data set to file...
Traceback (most recent call last):
File "find_best_model.py", line 167, in
extract_dataset(ds.data, nb_samples=len(ds.targets), dataset=dataset)
File "/home/shakey/speech_emotion_recongtion/Speech_emotion_recognition_BLSTM/utility/audio.py", line 73, in extract_dataset
hr_pitch = audioFeatureExtraction.stFeatureSpeed(x, Fs, globalvars.frame_size * Fs, globalvars.step * Fs)
File "/home/shakey/.local/lib/python3.6/site-packages/pyAudioAnalysis/audioFeatureExtraction.py", line 685, in stFeatureSpeed
[fbank, freqs] = mfccInitFilterBanks(fs, nfft, lowfreq, linsc, logsc, nlinfil, nlogfil)
TypeError: mfccInitFilterBanks() takes 2 positional arguments but 7 were given

Speech_emotion_recognition_BLSTM error

Loading data and features...
Number of samples: 535
Traceback (most recent call last):
File "/home/lwin/speech-emotion/Speech_emotion_recognition_BLSTM-master1/find_best_model.py", line 171, in
trials=trials)
File "/usr/local/lib/python3.5/dist-packages/hyperas/optim.py", line 67, in minimize
verbose=verbose)
File "/usr/local/lib/python3.5/dist-packages/hyperas/optim.py", line 115, in base_minimizer
space=get_space(),
File "./temp_model.py", line 203, in get_space
NameError: name 'sgd' is not defined
Please help to solve!

Other languages

Hi,

Does Speech_emotion_recognition_BLSTM work with other languages or only in German?

error about ipykernel

when I am trying to run the code I am getting the following error
Usage: ipykernel_launcher.py [options]

ipykernel_launcher.py: error: no such option: -f

An exception has occurred, use %tb to see the full traceback.

SystemExit: 2

please respond.Thank you in advance

Dataset has no attribute 'data' in find_best_model in line 190

Dataset is not loaded into the file, when i try another speech dataset (SAVEE/CASIA.IEMOCAP), help me to know ,how to solve this issue?

meet a issue in python3

File "F:/123/find_best_model.py", line 160, in
ds = pickle.load(open(dataset + '_db.p', 'rb'))

FileNotFoundError: [Errno 2] No such file or directory: 'berlin_db.p'

which method does this repo based on?

Hi RayanWANG, i have go through ur repo, first of all, that's a good work, thx for ur share.
But, i want to know the method u use. I see that u have put 2 reference here, but which one do u based on?
I want more information to understand better the algorithm.
If u can tell the one u use, that will be great.
Thx.

accuracy

sir I applied this code to ieamocap dataset. i got 54.5 after running find_best_model.py . but i didn't get 63.3 accuracy as you mentioned in paper . can you tell me how much you got and why i got less . and also can you please explain me the use of model_cross_validation.py

I have a statement that doesn't quite understand

Hi Rayan:
Thanks,I'm sorry I ask for your help again.Thanks very much for your enjoying code that helps me a lot.When I read the code,I have a statement that doesnit quite understand in ‘find_best_model.py’.In the function of the create_model,"globalvars.globalVar += 1",this code I don't understand ,Please help explain this statement.Please help me,thanks very much!

Can use this trained model to predict the emotion of English audio (Bcz Berlin db is a German db.)

Program just stops without any messa

Hi,
For some reason, when I try to run training of the model, the program starts and then it just exits without any message. What could be the issue?

I'm using Python 3.6, so I did comment out the invocation of method 'mfccInitFilterBanks' in stFeatureSpeed and did cast M to integer.

Any suggestions?

TypeError: 'float' object cannot be interpreted as an integer

Please help me to solve this problem：
（environment：Python3.5）

Traceback (most recent call last):
File "find_best_model.py", line 167, in
extract_dataset(ds.data, nb_samples=len(ds.targets), dataset=dataset)
File "E:\TensorFlow\GitHub\Speech_emotion_recognition_BLSTM-master\utility\audio.py", line 75, in extract_dataset
hr_pitch = ShortTermFeatures.speed_feature(x, Fs, globalvars.frame_size * Fs, globalvars.step * Fs)
File "C:\Users\asus\Anaconda3\envs\tensorflow\lib\site-packages\pyAudioAnalysis\ShortTermFeatures.py", line 473, in speed_feature
logsc, nlinfil, nlogfil)
File "C:\Users\asus\Anaconda3\envs\tensorflow\lib\site-packages\pyAudioAnalysis\ShortTermFeatures.py", line 199, in mfcc_filter_banks
fbank = np.zeros((num_filt_total, num_fft))
TypeError: 'float' object cannot be interpreted as an integer

Thanks!

dataset.py file doubt

Could you please explain to me why you use this line of code?
"for speak_test in itertools.product(males, females): # test_couples:"
Shouldn't you only use a for cycle for going over all the audio files once?

how do you deal with the variant length of audio

Hi RayanWang,

I have gone through your code these days, thank you so much for your sharing and it is really nice work.

But I still have a question, can you tell me in which part of your code is to deal with the length of the audio data. I also work on Berlin Dataset, but the audio has a different length from each other. I used the padding method but the results were not that good as yours.

I am looking forward to getting your reply.

Chason