Sign language recognition

Hi! This repository is a blog about a sign language recognition project I did. The final program was able to identify 100 sign language words in real-time. Here are two videos demonstrating the final program.

tall.woman.decide.change.pink.shirt.mov

black.bird.eat.apple.before.brown.cow.mov

Pretty cool, don't you think? As I mentioned, this repository is structured as a blog and includes 15 posts detailing my approach, the issues I encountered, and my thoughts along the way.

Project description

American sign language is commonly used by the deaf community in North America. The language is entierly visual and involves making complex gestures with the hands.

My goal for this project was to create a sign language interpretation program that could recognize American sign language letters and words. I wanted to use various deep learning models and methods such as convolutional and recurrent networks. I also wanted to practice with libraries such as TensorFlow, Keras and OpenCV.

The project is separated into multiple phases.

Phase 1 - ASL alphabet recognition

For this first phase I had 2 main goals.

Train a model that can classify still images of ASL letters.
Run the model in real time using a live video from my webcam.

Blog posts 1 - 4 are related to Phase 1.

Introduction
Building a basic CNN model
Testing the model
Improving the model

The following code files contain the code for this phase:

"Code files/Phase 1 development files/sign_alphabet_train_model.py"
"Code files/Phase 1 development files/run_sign_language_alphabet_detector.py"

Here is a video of the final result of Phase 1.

Sign.language.recognition.final.mp4

Phase 1 was a success (mostly). I created a custom neural network that could classify images from my webcam and identify letters of the alphabet. The model accuracy wasn't perfect, but it was enough of a proof of concept to justify moving to phase 2.

Phase 2 - Word level recognition

In sign language, words are more complex than the alphabet. Here is a video of someone signing the word 'book'.

07068.mov

As you can see, there is movement involved. In sign language, most words cannot be identified from a single image. Phase 2 of the project had basically the same goals as phase 1, but using videos instead of still images.

For phase 2, I had 2 main goals.

Train a model that can classify video clips of ASL words.
Run the model in real time using video from my webcam.

Blog posts 5 - 15 are related to phase 2

Graduating from the alphabet to words
Training a model for video identification
Reorganizing and inspecting the dataset
Training a model for video identification
Setting up the real-time video classification
Scaling up the model
Switching to a pose estimation approach
Increasing the holistic feature model vocabulary
Refactoring to improve program speed
Implementing the new holistic cropping approach
Testing the final model

The code for these posts is available in '\Code files\Phase 2 development and test files'. However, these files are not well organized and may be difficult to follow. I recommend reading through the blog posts which include important sections of code and accompanying explanations. The final program is available in '\Code files\ASL_word_detector_main.py'.

Several different methods were used in Phase 2. A generic CNN based feature extractor was tested as well as the YOLOv5 model for object detection. In the end, both of these appraoches were abandoned in favour of the mediapipe Holistic model. The mediapipe model tracks body landmarks. Then, a custom cropping function draws a bounding box around the points that were generated by the Holistic model. Below is an example of the holistic model landmark tracking. The coordinates of these landmarks were used as features and passed to a custom classification model.

holistic.and.cropping.shown.mov

The custom classification model used GRU layers followed by several dense layers. The classification model was trained to identify 100 different words. Below is an example of the program in action. In the below example, the program recorded 2 seconds of video then classified that 2 second clip. In later versions of the code (such as the examples at the top of the page) this 2 second interval was reduced to 1 second.

mother.want.son.study.but.son.decide.play.basketball.mov

The final classification model reached a validation accuracy of 91.74%. When tested on myself performing all 100 sign language words, the model correctly identified 78 words on the first try and 94 of the words in 3 tries or fewer. Overall I am pleased with the model performance.

What I learned

The project provided an opportunity to practice a variety of python libraries and machine learning techniques.

In this project I worked with tensorflow, keras, opencv, pandas, and numpy libraries, among others. I used transfer learning for feature extraction as well as object detection with the YOLOv5 model. I identified problem with the model and dataset, and came up with solutions to improve classification performance. This included abandoning the aforementioned YOLOv5 model and re-factoring the code in order to run the code in real-time. The final program uses the mediapipe Holistic model for feature extraction and a custom GRU neural network for classification.

Thank you for taking the time to check out my project!

spardoel / sign-language-recognition Goto Github PK