Skip to content

CharBERT: Character-aware Pre-trained Language Model (COLING2020)

License

Notifications You must be signed in to change notification settings

edoppiap/CharBERT

 
 

Repository files navigation



CharBERT for IR, NER and sentiment analysis

This repository contains our code, datasets and finetuned models developed by our team Poli2Vec for the project of the DNLP course at PoliTo.
Authors: Mingrino Davide, Pietropaolo Emanuele, Martini Martina, Lungo Vaschetti Jacopo

Base Models + Finetuned

We primarily provide two models. Here are the download links:

Base

Finetuned

Datasets

SST2

SST2 plain
SST2 Adv.

BioMed NER

BioMed plain
BioMed Adv.

Directory Guide

root_directory
    |- modeling    # contains source codes of CharBERT model part
    |- data   # Character attack datasets and the dicts for CharBERT
    |- processors # contains source codes for processing the datasets
    |- IR_eval # contains source codes for testing the IR part of our RAG
    |- notebooks # contains our notebooks 
    |- shell     # the examples of shell script for training and evaluation
    |- run_*.py  # codes for pre-training or finetuning

Requirements

Python 3.6  
Pytorch 1.2
Transformers 2.4.0
sentence_transformers 2.4.0

Performance

Sentence Classification (SST2)

Plain Train

Model Plain test Adv. test
BERT 92.09 89.56
CharBERT 91.28 89.22

Adv. Train

Model Plain test Adv. test
BERT 92.20 90.02
CharBERT 90.94 90.25

Token Classification (BioMed NER)

Plain Train

Model Plain test Adv. test
BERT 27.62 31.61
CharBERT 46.01 50.09

Adv. Train

Model Plain test Adv. test
BERT 31.11 27.60
CharBERT 54.09 47.06

Usage

You may use another hyper-parameter set to adapt to your computing device, but it may require further tuning, especially learning_rate and num_train_epoch.

SST2 finetuning

MODEL_DIR=YOUR_MODEL_PARH/charbert-bert-pretrain 
SST2_DIR=YOUR_DATA_PATH/SST2
OUTPUT_DIR=YOUR_OUTPUT_PATH/SST2 
python run_glue.py \
    --model_name_or_path ${MODEL_DIR} \
    --task_name sst-2\
    --model_type bert \
    --do_train \
    --do_eval \
    --data_dir ${SST2_DIR} \
    --max_seq_length 128 \
    --per_gpu_eval_batch_size=16 \
    --per_gpu_train_batch_size=16 \
    --char_vocab ./data/dict/bert_char_vocab \
    --learning_rate 3e-5 \
    --save_steps 1000 \
    --num_train_epochs 8.0 \
    --overwrite_output_dir \
    --output_dir ${OUTPUT_DIR}

BioMed NER finetuning

You can use the plain or adv. partition in the BioMed_DIR variable, downloadable from the links before.

MODEL_DIR=YOUR_MODEL_PARH/charbert-bert-pretrain 
BioMed_DIR=YOUR_DATA_PATH/BioMed_NER
OUTPUT_DIR=YOUR_OUTPUT_PATH/BioMed_NER 
python run_ner.py \
    --model_type bert \
    --do_train \
    --do_eval \
    --model_name_or_path ${MODEL_DIR} \
    --char_vocab ./data/dict/bert_char_vocab \
    --data_dir ${BioMed_DIR} \
    --labels ${BioMed_DIR}/ner_tags.txt \
    --output_dir ${OUTPUT_DIR} \
    --learning_rate 3e-5 \
    --num_train_epochs 1 \
    --per_gpu_train_batch_size 4 \
    --per_gpu_eval_batch_size 4 \
    --save_steps 2000 \
    --max_seq_length 512 \
    --overwrite_output_dir

About

CharBERT: Character-aware Pre-trained Language Model (COLING2020)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 83.8%
  • Jupyter Notebook 15.7%
  • Shell 0.5%