Light

Suggestion needed about music-auto_tagging-keras HOT 14 CLOSED

keunwoochoi commented on July 21, 2024

Suggestion needed

from music-auto_tagging-keras.

Comments (14)

keunwoochoi commented on July 21, 2024

How are they different? How's the size? To rush, I'd try https://github.com/keunwoochoi/transfer_learning_music with ALL layer features concatenated + PCA or some fancy feature selection + SVM/random forest/etc (as suggested in the paper ).

I'd assume they are somehow a bit different in their sound, more than the language itself, since usually languages of songs are correlated to some other cultural aspect, which is somehow correlated to the sound. If they are almost identical but the language, sounds like quite a challenging task ;)

from music-auto_tagging-keras.

lvaleriu commented on July 21, 2024

5 languages : it, es, fr, de, en
500 mp3 files (44100hz, duration 30 sec) per language - total 2500 files
5+ different genres (but mainly voice songs - off course, all contain some instrumental part that i should filter maybe)

from music-auto_tagging-keras.

lvaleriu commented on July 21, 2024

Did you test extraction of features corresponding to small durations when training on audio datasets? Like 1, 2, 5 seconds length mfcc(s) features.

from music-auto_tagging-keras.

keunwoochoi commented on July 21, 2024

Yeah, (also described in the paper) I repeat them to make them 29s-signals so that the feature doesn't get distorted by many zero's.

from music-auto_tagging-keras.

lvaleriu commented on July 21, 2024

Oh, i understand better now.

from music-auto_tagging-keras.

lvaleriu commented on July 21, 2024

One more question: I get to train the Gtzan music vs speech dataset with a less deeper network (more like a lenet for audio with mfcc/s features) and i get a pretty good score. But when I try to use this network in order to do predictions on different songs i sometimes get good result and sometimes .... very bad results. For ex, i have many examples with people talking (0 music sound) where the nn prediction is bad (a score of 0.4 - doing a mean over the song duration, since i'm training with 1, 2, 5 sec features - when 0 should be close to speech and 1 close to music).

Do you encounter such situation when you use learning transfer and get good accuracy on a dataset and after you try to use the classification in a "real-world" situation? Do the features (low or hig level) of layers still perform well outside the dataset?

from music-auto_tagging-keras.

keunwoochoi commented on July 21, 2024

Hm, I tested it to 6 different datasets in https://github.com/keunwoochoi/transfer_learning_music and most of them (if not all) are rather real songs. So yes, I think they do.

from music-auto_tagging-keras.

lvaleriu commented on July 21, 2024

I'm trying to train using the method described in the transfer_learning_music on the Gtzan SpeechVsMusic dataset.

Can I use a simple network like the following and hope to have results? I'm asking this since I've reached a validation accuracy of more than 0.95, but when I do prediction on other songs I have always the value predicted_y = [0, 1] for any song. How much the validation loss should decrease? Now, the value is smaller then 0.1.

 model = Sequential()
 model.add(InputLayer(input_shape=(1, 160)))
 model.add(Flatten())
 model.add(Dense(256, activation='relu'))
 model.add(Dropout(0.5))
 model.add(Dense(2, activation='softmax'))
 model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer=Adam(lr=1e-4))

x.shape =(103, 1, 160)
y.shape =(103, 2)

vx.shape =(26, 1, 160)
vy.shape =(26, 2)

Layer (type)                     Output Shape          Param #     Connected to                     
input_1 (InputLayer)             (None, 1, 160)        0                                            
____________________________________________________________________________________________________
flatten_2 (Flatten)              (None, 160)           0           input_1[0][0]                    
____________________________________________________________________________________________________
dense_1 (Dense)                  (None, 256)           41216       flatten_2[0][0]                  
____________________________________________________________________________________________________
dropout_1 (Dropout)              (None, 256)           0           dense_1[0][0]                    
____________________________________________________________________________________________________
dense_2 (Dense)                  (None, 2)             514         dropout_1[0][0]                  

Total params: 41,730
Trainable params: 41,730
Non-trainable params: 0

Thanks again for your precious advices.

from music-auto_tagging-keras.

lvaleriu commented on July 21, 2024

Here is a screenshot of the training process:

from music-auto_tagging-keras.

keunwoochoi commented on July 21, 2024

I have always the value predicted_y = [0, 1] for any song

What does it mean?

from music-auto_tagging-keras.

keunwoochoi commented on July 21, 2024

The graphs seem alright overall. But Gtzan speech/music is the least interesting and most trivial problem among 6 tasks in the paper. I mean, no one's using deep learning for 129 data samples.

But I also think it should be fine with random songs. If you're really into this problem, probably t-sne of all the training/validation/test features + out-of-dataset song might tell something.

from music-auto_tagging-keras.

lvaleriu commented on July 21, 2024

"I have always the value predicted_y = [0, 1] for any song
What does it mean?" -> A stupid thing. I have rounded the results when showing them. Sorry for that.

But anyway, the problem that i still have is that when trying to predict on 2 audio files (one instrumental, the other one contains only speech - someone reading a story for children, no music background) i still obtain something like bellow. The 2 pictures shouldnt look the same.
(In order to obtain this graph i'm extracting a sliding windows of 29 seconds with a step of 3000 samples and for each window i generate a feature. In final i'll have a batch of features in input for a file so i'll obtain a batch of predictions for each sliding window. The prediction is of categorical type ([0, 1] for music, [1, 0] for speech - or the opposite).

from music-auto_tagging-keras.

lvaleriu commented on July 21, 2024

I have 2 tasks:
1°. I need to separate music from the speech part from audio files. So i thought it would be a good idea to use this dataset. It is already there. Otherwise, I have my own manually selected dataset with much more data.

2°. I need to classify language of songs where there is voice, as I mentioned at the beginning of our discussion. And i have a somehow big dataset that was choosen manually. And i still can augment it if needed. But before doing training on it i thought that i might filter the non singing part from the song (like instrumental part). So training on a dataset like jamendo could help (at least in my mind).

I hope i made it clearer.

from music-auto_tagging-keras.

keunwoochoi commented on July 21, 2024

Okay.

Could you try some other classifier than 1-hidden layer neural networks? e.g. SVM.
That's a well-known task and you can check out papers that cite Jamendo dataset. As you see, my network's input is much longer than the decision resolution you'd like to have so it's not suitable enough.

from music-auto_tagging-keras.

Related Issues (20)

Compact cnn training precision HOT 5
Preview Audio files for training HOT 4
How to use compact_cnn for tagging? HOT 1
problem with the model generation HOT 4
compact cnn how to obtain the dominant tags ? HOT 10
Cannot use CRNN with Theano backend HOT 27
Please provide crnn training example HOT 2
No backend error HOT 2
Filters kernel HOT 1
sound file duration HOT 1
How did you get the actual mp3s ? HOT 1
the npy for mp3 is not the same with the logamplitude melspectrogram I got using the code audio_processor.py HOT 1
Dockerfile for run experiments HOT 1
I think the weight file for cnn_tensorflow is not the correct file. HOT 2
FCN-4 Convolve problem HOT 7
Can i use MusicTaggerCRNN for extract feature from another feature audio like chromagram not mel-spectrogram? HOT 1
too many syntax error in, say compact cnn HOT 1
Having trouble running the compact CNN model HOT 1
compact cnn에서 tag추출하기 위한 pretrained model을 구할 수 있나요?
ValueError: Input dimension mis-match. HOT 1

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.