NOTE: For lipsynced video results, scroll below
1. Specify the file paths, and make a new conda environment with Python = 3.6
2. Now Install the necessary Liabraries from requirements.txt, and make a media folder, inside of which add the video and audio
3. Simply Run "Python main.py" and your Final video will be ready
Objectives Achieved:
- Visual and Audio Quality Lip Sync: The project successfully lip-syncs videos with improved visual and audio quality, ensuring that the lip movements accurately match the spoken words.
- Robustness for Any Video: Unlike the original Wav2Lip model, the developed AI can handle videos with or without a face in each frame, making it more versatile and error-free.
- Support for Longer Videos: The model overcomes the limitations of the original Wav2Lip GAN model, now effectively lip-syncing longer videos exceeding 1 minute in duration.
- Particular Segments Can be extracted easily, unlike the original Model, any part with or without face can be extracted, with te desired audio combined.
Metrics: I haven't trained the model of any further dataset, so the Metrics is the same for the model
- Average Mean Squared Error = 5.050382572478908
- Average Peak Signal to Noise Ratio = 40.32044758489997
Challenges:
1. Since, my aim was to extract any particulkar segment, I had to be very concious around timestamps
2. Wav2Lip also doesn't have a mechanism to make a distinction between the target speaker and other faces that appears in the video.
3. The Model does not perform good for high resolution videos
4. Long runtimes for longer videos.
Results include LipSync Videos for:
- Hindi Voice-Over on English Video.
- Long Videos with some No-face or Other than Target speaker face with a lot of head and hands movement & Telugu to Hindi Translation Voice-Over synced
results can be reproduced using the colab notebook or can be accessed at this google drive for reference: https://drive.google.com/file/d/1zjxMi1p3S9SL9UuatoC-2RgWbepyZWUY/view?usp=sharing
NOTE:
- Instead of rendering the whole video at once, my approach breaks it into small pieces, which makes the process faster.
- I had a different approach of this, rather than just skipping the frames, with no face(Recommende), I wrote the code for extracting the videos with faces and without faces, and individually rendering them. Through this, I didn't made any changes int the internal filing of the model.
- All my work is done in main.py