Classification of ISL using Pose and Object Detection based Techniques

According to the 2018-19 Annual report by the Indian Sign Language Research and Training Centre (ISLRTC), there are only 325 certified Sign language interpreters in India. This number is far lesser than required for translating the current 50,71,007 hearing-impaired and 19,98,535 speech-impaired persons in India (Census 2011).

Many times, the organizations required students and employees to work from home due to pandemics. This also helps organizations save money on various expenditures and is more time-efficient. This change has affected the deaf and hard-of-hearing students and job candidates negatively as, when hiring the candidate requires good internet connection and pin-drop silence at home but for hiring hard-of-hearing job candidates additional requirements such as availability of an interpreter becomes an additional hurdle. Also, hard-of-hearing students face similar issues in online lectures.

Thus, we present two approaches for the classification of Indian Sign Language:

  • Pose-based approach utilizes an LSTM model which takes the skeletal pose landmarks from Mediapipe for a sequence of frames as an input to infer and predict the action.
  • Object detection-based approach utilizes a model built on Scaled-YOLOv4 architecture which performs a frame-by-frame inference

Pose-based Estimation using LSTM

In this initial approach, the Sign-to-Text translation was performed by acquiring pose skeleton of the person from the input video which is achieved with the help of Holistic model from the Mediapipe Python library. This serves as a preprocessing step for the input is passed to the actual LSTM Model which then, provides the meaning of the Sign in the form of the classification result.


The input takes the form of a video or images. The input video frames would be given to the LSTM model once the landmarks have been extracted using the MP Holistic model. Finally, the LSTM model infers the input to provide the appropriate text for the given action.


Classification Report for LSTM

 PrecisionRecallF1-scoreSupport
Hello11117
I10.910.9511
Thank You0.9510.9718
Deaf0.9410.9715
Work10.930.9715
Study10.950.9720
Good0.960.980.9716
Bad110.9516
accuracy0.98  128
macro avg0.980.980.98128
weighted avg0.98970.97128

Confusion Matrix for LSTM

ClassesTrue PositiveFalse PositiveTrue NegativeFalse Negative
Hello1110170
I1170101
Thank You1091180
Deaf1121150
Work1130141
Study1080200
Good1120160
Bad1120151

Accuracy Graph for LSTM


Object detection-based Classification using Scaled YOLO-v4

The initial approach, i.e the skeleton approach has a set of drawbacks when it comes to real-time deployment of models in a dynamic environment. The performance of the LSTM model developed for pose-based approach dwindled exponentially with increase in classes. Thus, with an imminent need for a newer approach, the idea of treating the hand-signs as an object was brought up. Finally, YOLO was utilized for the detection of these hand-sign objects.


The input takes the form of a video or images. The input video frames would be given to the YOLO model once the image frame has been grayscale and then, output shall be generated for each in the form of bounding box coordinates and a class result.


Mean Average Precision (0.5:0.95) Graph for Scaled-YOLOv4


Precision for Scaled-YOLOv4


Recall for Scaled-YOLOv4