Giter Club home page Giter Club logo

Comments (14)

keunwoochoi avatar keunwoochoi commented on July 21, 2024

How are they different? How's the size? To rush, I'd try https://github.com/keunwoochoi/transfer_learning_music with ALL layer features concatenated + PCA or some fancy feature selection + SVM/random forest/etc (as suggested in the paper ).

I'd assume they are somehow a bit different in their sound, more than the language itself, since usually languages of songs are correlated to some other cultural aspect, which is somehow correlated to the sound. If they are almost identical but the language, sounds like quite a challenging task ;)

from music-auto_tagging-keras.

lvaleriu avatar lvaleriu commented on July 21, 2024

5 languages : it, es, fr, de, en
500 mp3 files (44100hz, duration 30 sec) per language - total 2500 files
5+ different genres (but mainly voice songs - off course, all contain some instrumental part that i should filter maybe)

from music-auto_tagging-keras.

lvaleriu avatar lvaleriu commented on July 21, 2024

Did you test extraction of features corresponding to small durations when training on audio datasets? Like 1, 2, 5 seconds length mfcc(s) features.

from music-auto_tagging-keras.

keunwoochoi avatar keunwoochoi commented on July 21, 2024

Yeah, (also described in the paper) I repeat them to make them 29s-signals so that the feature doesn't get distorted by many zero's.

from music-auto_tagging-keras.

lvaleriu avatar lvaleriu commented on July 21, 2024

Oh, i understand better now.

from music-auto_tagging-keras.

lvaleriu avatar lvaleriu commented on July 21, 2024

One more question: I get to train the Gtzan music vs speech dataset with a less deeper network (more like a lenet for audio with mfcc/s features) and i get a pretty good score. But when I try to use this network in order to do predictions on different songs i sometimes get good result and sometimes .... very bad results. For ex, i have many examples with people talking (0 music sound) where the nn prediction is bad (a score of 0.4 - doing a mean over the song duration, since i'm training with 1, 2, 5 sec features - when 0 should be close to speech and 1 close to music).

Do you encounter such situation when you use learning transfer and get good accuracy on a dataset and after you try to use the classification in a "real-world" situation? Do the features (low or hig level) of layers still perform well outside the dataset?

from music-auto_tagging-keras.

keunwoochoi avatar keunwoochoi commented on July 21, 2024

Hm, I tested it to 6 different datasets in https://github.com/keunwoochoi/transfer_learning_music and most of them (if not all) are rather real songs. So yes, I think they do.

from music-auto_tagging-keras.

lvaleriu avatar lvaleriu commented on July 21, 2024

I'm trying to train using the method described in the transfer_learning_music on the Gtzan SpeechVsMusic dataset.

Can I use a simple network like the following and hope to have results? I'm asking this since I've reached a validation accuracy of more than 0.95, but when I do prediction on other songs I have always the value predicted_y = [0, 1] for any song. How much the validation loss should decrease? Now, the value is smaller then 0.1.

 model = Sequential()
 model.add(InputLayer(input_shape=(1, 160)))
 model.add(Flatten())
 model.add(Dense(256, activation='relu'))
 model.add(Dropout(0.5))
 model.add(Dense(2, activation='softmax'))
 model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer=Adam(lr=1e-4))

x.shape =(103, 1, 160)
y.shape =(103, 2)

vx.shape =(26, 1, 160)
vy.shape =(26, 2)

Layer (type)                     Output Shape          Param #     Connected to                     
input_1 (InputLayer)             (None, 1, 160)        0                                            
____________________________________________________________________________________________________
flatten_2 (Flatten)              (None, 160)           0           input_1[0][0]                    
____________________________________________________________________________________________________
dense_1 (Dense)                  (None, 256)           41216       flatten_2[0][0]                  
____________________________________________________________________________________________________
dropout_1 (Dropout)              (None, 256)           0           dense_1[0][0]                    
____________________________________________________________________________________________________
dense_2 (Dense)                  (None, 2)             514         dropout_1[0][0]                  

Total params: 41,730
Trainable params: 41,730
Non-trainable params: 0

Thanks again for your precious advices.


from music-auto_tagging-keras.

lvaleriu avatar lvaleriu commented on July 21, 2024

Here is a screenshot of the training process:

image

from music-auto_tagging-keras.

keunwoochoi avatar keunwoochoi commented on July 21, 2024

I have always the value predicted_y = [0, 1] for any song

What does it mean?

from music-auto_tagging-keras.

keunwoochoi avatar keunwoochoi commented on July 21, 2024

The graphs seem alright overall. But Gtzan speech/music is the least interesting and most trivial problem among 6 tasks in the paper. I mean, no one's using deep learning for 129 data samples.

But I also think it should be fine with random songs. If you're really into this problem, probably t-sne of all the training/validation/test features + out-of-dataset song might tell something.

from music-auto_tagging-keras.

lvaleriu avatar lvaleriu commented on July 21, 2024

"I have always the value predicted_y = [0, 1] for any song
What does it mean?" -> A stupid thing. I have rounded the results when showing them. Sorry for that.

But anyway, the problem that i still have is that when trying to predict on 2 audio files (one instrumental, the other one contains only speech - someone reading a story for children, no music background) i still obtain something like bellow. The 2 pictures shouldnt look the same.
(In order to obtain this graph i'm extracting a sliding windows of 29 seconds with a step of 3000 samples and for each window i generate a feature. In final i'll have a batch of features in input for a file so i'll obtain a batch of predictions for each sliding window. The prediction is of categorical type ([0, 1] for music, [1, 0] for speech - or the opposite).

image

image

from music-auto_tagging-keras.

lvaleriu avatar lvaleriu commented on July 21, 2024

I have 2 tasks:
1°. I need to separate music from the speech part from audio files. So i thought it would be a good idea to use this dataset. It is already there. Otherwise, I have my own manually selected dataset with much more data.

2°. I need to classify language of songs where there is voice, as I mentioned at the beginning of our discussion. And i have a somehow big dataset that was choosen manually. And i still can augment it if needed. But before doing training on it i thought that i might filter the non singing part from the song (like instrumental part). So training on a dataset like jamendo could help (at least in my mind).

I hope i made it clearer.

from music-auto_tagging-keras.

keunwoochoi avatar keunwoochoi commented on July 21, 2024

Okay.

  1. Could you try some other classifier than 1-hidden layer neural networks? e.g. SVM.
  2. That's a well-known task and you can check out papers that cite Jamendo dataset. As you see, my network's input is much longer than the decision resolution you'd like to have so it's not suitable enough.

from music-auto_tagging-keras.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.