csv file containes these fields : SrNo.,Utterance,Speaker,Emotion,Sentiment,Dialogue_ID,Utterance_ID,Season,Episode,StartTime,EndTime
For our task we would be considering Utterance, Dialogue_ID,Utterance_ID
Using the Dialogue_ID,Utterance_ID , we can frame the video(mp3) file name corresponding to the utterance
We consider the same audio recording properties used in the SEMAINE database i.e audio was recorded at 48 kHz with 24 bits per sample
While extracting audio features from the mp4 file we consider this information
- Speech to Text - Speech Recognition Model
We extract the audio features(frequency, amplitude at respective time-steps) from the input mp4 file and use exisiting model(transfer learning) to predict the utterance. Once we get good accuracy over the model, we proceed to next step
- Text to Image : We use COCO dataset for this task