Skip to content

MoyoG/slr_vit

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 

Repository files navigation

SignNet: Vision Transformer-Based Isolated Sign Language Recognition with A Novel Head & Hands Tunnelling Preprocessing Method

Ganzorig Batnasan, Hanan Aldarmaki, Munkh-Erdene Otgonbold Timothy K. Shih, Fady Alnajjar, Qurban Ali Memon, Gantumur Tsogtgerel, Enkhbat Tsedenbaljir, Tan-Hsu Tan and Munkhjargal Gochoo.

flowchart

Abstract: *Sign Language Recognition (SLR) is a scene and subject invariant fine-grained video classification task, which conveys the information mainly with hand gestures and facial expressions. However, these attributes are not well represented in general-purpose pre-trained Video Transformers (ViTs) due to a) deterioration of features in the interested regions when the raw frames are downsized before they are fed to the model, and b) general-purpose ViTs are not domain-specific for the SLR. This study presents a SLR method, namely SignNet, comprises of a) pretrained ViT model on a large domain-specific SLR dataset and b) a novel preprocessing technique, termed H&HT, which highlights the critical regions of the head and hands from raw sign language videos. The SignNet achieved the state-of-the-art accuracy of 62.82% (Top-1, 3-crop) and 61.81% (Top-1, 1-crop) on the WLASL2000 benchmark; further, excels on the revised versions of WLASL2000 and ASL-Citizen datasets. For instance, on the revised version of the WLASL2000, our proposed pre-processing H&HT method elevated the vanilla ViT top-1 accuracy from 79.14% to 82.41%. *


flowchart


Performance and Checkpoints

Dataset P-I Top1 P-I Top5 P-C Top1 P-C Top5 Ckpt Training
WLASL-2000 62.82 92.88 60.01 92.2 GDrive script
WLASL-1000 75.16 95.79 75.27 95.85 GDrive script
WLASL-300 87.28 97.75 87.74 97.81 GDrive script
WLASL-100 90.31 97.67 90.25 97.58 GDrive script

More checkpoints are available at GDrive

Usage

Environment

It is better to use docker:

docker pull ganzobtn/slr_vit

You may also install packages at

pip install -r requirements.txt

Data Preparation

We use two datasets: WLASL and ASL-Citizen

Alternative labelling of WLASL dataset can be downloaded from here.

Our new proposed WLASL benchmark with 1518 and 1573 classes labelling can be downloaded from here

VideoMAEv2 pretrained model

Download VideoMAEv2 checkpoints from here


Training

The training instruction is in FINETUNE.md.

Citation

Contact

Should you have any questions, please create an issue in this repository or contact at ganzobtn@gmail.com, Omunkhuush01@gmail.com or mgochoo@uaeu.ac.ae


References

Our code is based on VideoMAEv2. We thank them for releasing their code.


About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Shell 100.0%