CharBERT for IR, NER and sentiment analysis

This repository contains our code, datasets and finetuned models developed by our team Poli2Vec for the project of the DNLP course at PoliTo.
Authors: Mingrino Davide, Pietropaolo Emanuele, Martini Martina, Lungo Vaschetti Jacopo

Base Models + Finetuned

We primarily provide two models. Here are the download links:

Base

pre-trained CharBERT based on BERT charbert-bert-wiki
pre-trained CharBERT based on RoBERTa charbert-roberta-wiki

Finetuned

finetuned CharBERT on SST2 Plain. charbert_SST2_plain
finetuned CharBERT on SST2 Adv. charbert_SST2_adv
finetuned CharBERT on BioMed NER Plain. charbert_NER_plain
finetuned CharBERT on BioMed NER Adv. charbert_NER_adv

Datasets

SST2

SST2 plain
SST2 Adv.

BioMed NER

BioMed plain
BioMed Adv.

Directory Guide

root_directory
    |- modeling    # contains source codes of CharBERT model part
    |- data   # Character attack datasets and the dicts for CharBERT
    |- processors # contains source codes for processing the datasets
    |- IR_eval # contains source codes for testing the IR part of our RAG
    |- notebooks # contains our notebooks 
    |- shell     # the examples of shell script for training and evaluation
    |- run_*.py  # codes for pre-training or finetuning

Requirements

Python 3.6  
Pytorch 1.2
Transformers 2.4.0
sentence_transformers 2.4.0

Performance

Sentence Classification (SST2)

Plain Train

Model	Plain test	Adv. test
BERT	92.09	89.56
CharBERT	91.28	89.22

Adv. Train

Model	Plain test	Adv. test
BERT	92.20	90.02
CharBERT	90.94	90.25

Token Classification (BioMed NER)

Plain Train

Model	Plain test	Adv. test
BERT	27.62	31.61
CharBERT	46.01	50.09

Adv. Train

Model	Plain test	Adv. test
BERT	31.11	27.60
CharBERT	54.09	47.06

Usage

You may use another hyper-parameter set to adapt to your computing device, but it may require further tuning, especially learning_rate and num_train_epoch.

SST2 finetuning

MODEL_DIR=YOUR_MODEL_PARH/charbert-bert-pretrain 
SST2_DIR=YOUR_DATA_PATH/SST2
OUTPUT_DIR=YOUR_OUTPUT_PATH/SST2 
python run_glue.py \
    --model_name_or_path ${MODEL_DIR} \
    --task_name sst-2\
    --model_type bert \
    --do_train \
    --do_eval \
    --data_dir ${SST2_DIR} \
    --max_seq_length 128 \
    --per_gpu_eval_batch_size=16 \
    --per_gpu_train_batch_size=16 \
    --char_vocab ./data/dict/bert_char_vocab \
    --learning_rate 3e-5 \
    --save_steps 1000 \
    --num_train_epochs 8.0 \
    --overwrite_output_dir \
    --output_dir ${OUTPUT_DIR}

BioMed NER finetuning

You can use the plain or adv. partition in the BioMed_DIR variable, downloadable from the links before.

MODEL_DIR=YOUR_MODEL_PARH/charbert-bert-pretrain 
BioMed_DIR=YOUR_DATA_PATH/BioMed_NER
OUTPUT_DIR=YOUR_OUTPUT_PATH/BioMed_NER 
python run_ner.py \
    --model_type bert \
    --do_train \
    --do_eval \
    --model_name_or_path ${MODEL_DIR} \
    --char_vocab ./data/dict/bert_char_vocab \
    --data_dir ${BioMed_DIR} \
    --labels ${BioMed_DIR}/ner_tags.txt \
    --output_dir ${OUTPUT_DIR} \
    --learning_rate 3e-5 \
    --num_train_epochs 1 \
    --per_gpu_train_batch_size 4 \
    --per_gpu_eval_batch_size 4 \
    --save_steps 2000 \
    --max_seq_length 512 \
    --overwrite_output_dir

Name		Name	Last commit message	Last commit date
Latest commit History 80 Commits
IR_eval		IR_eval
data		data
modeling		modeling
notebooks		notebooks
processors		processors
shell		shell
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
create_adv_dataset_ner.py		create_adv_dataset_ner.py
create_adv_dataset_tsv.py		create_adv_dataset_tsv.py
create_adv_ds.ipynb		create_adv_ds.ipynb
requirements.txt		requirements.txt
run_embed.py		run_embed.py
run_glue.py		run_glue.py
run_lm_finetuning.py		run_lm_finetuning.py
run_ner.py		run_ner.py
run_pos.py		run_pos.py
run_squad.py		run_squad.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CharBERT for IR, NER and sentiment analysis

Base Models + Finetuned

Datasets

SST2

BioMed NER

Directory Guide

Requirements

Performance

Sentence Classification (SST2)

Plain Train

Adv. Train

Token Classification (BioMed NER)

Plain Train

Adv. Train

Usage

SST2 finetuning

BioMed NER finetuning

About

Uh oh!

Releases

Packages

Languages

License

edoppiap/CharBERT

Folders and files

Latest commit

History

Repository files navigation

CharBERT for IR, NER and sentiment analysis

Base Models + Finetuned

Datasets

SST2

BioMed NER

Directory Guide

Requirements

Performance

Sentence Classification (SST2)

Plain Train

Adv. Train

Token Classification (BioMed NER)

Plain Train

Adv. Train

Usage

SST2 finetuning

BioMed NER finetuning

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages