SignNet: Vision Transformer-Based Isolated Sign Language Recognition with A Novel Head & Hands Tunnelling Preprocessing Method
Ganzorig Batnasan, Hanan Aldarmaki, Munkh-Erdene Otgonbold Timothy K. Shih, Fady Alnajjar, Qurban Ali Memon, Gantumur Tsogtgerel, Enkhbat Tsedenbaljir, Tan-Hsu Tan and Munkhjargal Gochoo.
Abstract: *Sign Language Recognition (SLR) is a scene and subject invariant fine-grained video classification task, which conveys the information mainly with hand gestures and facial expressions. However, these attributes are not well represented in general-purpose pre-trained Video Transformers (ViTs) due to a) deterioration of features in the interested regions when the raw frames are downsized before they are fed to the model, and b) general-purpose ViTs are not domain-specific for the SLR. This study presents a SLR method, namely SignNet, comprises of a) pretrained ViT model on a large domain-specific SLR dataset and b) a novel preprocessing technique, termed H&HT, which highlights the critical regions of the head and hands from raw sign language videos. The SignNet achieved the state-of-the-art accuracy of 62.82% (Top-1, 3-crop) and 61.81% (Top-1, 1-crop) on the WLASL2000 benchmark; further, excels on the revised versions of WLASL2000 and ASL-Citizen datasets. For instance, on the revised version of the WLASL2000, our proposed pre-processing H&HT method elevated the vanilla ViT top-1 accuracy from 79.14% to 82.41%. *
| Dataset | P-I Top1 | P-I Top5 | P-C Top1 | P-C Top5 | Ckpt | Training |
|---|---|---|---|---|---|---|
| WLASL-2000 | 62.82 | 92.88 | 60.01 | 92.2 | GDrive | script |
| WLASL-1000 | 75.16 | 95.79 | 75.27 | 95.85 | GDrive | script |
| WLASL-300 | 87.28 | 97.75 | 87.74 | 97.81 | GDrive | script |
| WLASL-100 | 90.31 | 97.67 | 90.25 | 97.58 | GDrive | script |
More checkpoints are available at GDrive
It is better to use docker:
docker pull ganzobtn/slr_vit
You may also install packages at
pip install -r requirements.txtWe use two datasets: WLASL and ASL-Citizen
Alternative labelling of WLASL dataset can be downloaded from here.
Our new proposed WLASL benchmark with 1518 and 1573 classes labelling can be downloaded from here
VideoMAEv2 pretrained model
Download VideoMAEv2 checkpoints from here
The training instruction is in FINETUNE.md.
Should you have any questions, please create an issue in this repository or contact at ganzobtn@gmail.com, Omunkhuush01@gmail.com or mgochoo@uaeu.ac.ae
Our code is based on VideoMAEv2. We thank them for releasing their code.

