Skip to content
This repository was archived by the owner on Aug 6, 2025. It is now read-only.
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,6 @@ __pycache__/

# log files
*.log
*.txt

# data files
data/senteval_data*
151 changes: 110 additions & 41 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@
SentEval is a library for evaluating the quality of sentence embeddings. We assess their generalization power by using them as features on a broad and diverse set of "transfer" tasks. **SentEval currently includes 17 downstream tasks**. We also include a suite of **10 probing tasks** which evaluate what linguistic properties are encoded in sentence embeddings. Our goal is to ease the study and the development of general-purpose fixed-size sentence representations.


**(05/08) SentEval new tasks: Added russian dataset for downstream tasks; Added example scripts for Russian and English datasets for four sentence encoders: Word2Vec, ELMo, BERT, Multilingual USE**

**(04/22) SentEval new tasks: Added probing tasks for evaluating what linguistic properties are encoded in sentence embeddings**

**(10/04) SentEval example scripts for three sentence encoders: [SkipThought-LN](https://github.com/ryankiros/layer-norm#skip-thoughts)/[GenSen](https://github.com/Maluuba/gensen)/[Google-USE](https://tfhub.dev/google/universal-sentence-encoder/1)**
Expand All @@ -18,28 +20,35 @@ This code is written in python. The dependencies are:
## Transfer tasks

### Downstream tasks
SentEval allows you to evaluate your sentence embeddings as features for the following *downstream* tasks:

| Task | Type | #train | #test | needs_train | set_classifier |
|---------- |------------------------------ |-----------:|----------:|:-----------:|:----------:|
| [MR](https://nlp.stanford.edu/~sidaw/home/projects:nbsvm) | movie review | 11k | 11k | 1 | 1 |
| [CR](https://nlp.stanford.edu/~sidaw/home/projects:nbsvm) | product review | 4k | 4k | 1 | 1 |
| [SUBJ](https://nlp.stanford.edu/~sidaw/home/projects:nbsvm) | subjectivity status | 10k | 10k | 1 | 1 |
| [MPQA](https://nlp.stanford.edu/~sidaw/home/projects:nbsvm) | opinion-polarity | 11k | 11k | 1 | 1 |
| [SST](https://nlp.stanford.edu/sentiment/index.html) | binary sentiment analysis | 67k | 1.8k | 1 | 1 |
| **[SST](https://nlp.stanford.edu/sentiment/index.html)** | **fine-grained sentiment analysis** | 8.5k | 2.2k | 1 | 1 |
| [TREC](http://cogcomp.cs.illinois.edu/Data/QA/QC/) | question-type classification | 6k | 0.5k | 1 | 1 |
| [SICK-E](http://clic.cimec.unitn.it/composes/sick.html) | natural language inference | 4.5k | 4.9k | 1 | 1 |
| [SNLI](https://nlp.stanford.edu/projects/snli/) | natural language inference | 550k | 9.8k | 1 | 1 |
| [MRPC](https://aclweb.org/aclwiki/Paraphrase_Identification_(State_of_the_art)) | paraphrase detection | 4.1k | 1.7k | 1 | 1 |
| [STS 2012](https://www.cs.york.ac.uk/semeval-2012/task6/) | semantic textual similarity | N/A | 3.1k | 0 | 0 |
| [STS 2013](http://ixa2.si.ehu.es/sts/) | semantic textual similarity | N/A | 1.5k | 0 | 0 |
| [STS 2014](http://alt.qcri.org/semeval2014/task10/) | semantic textual similarity | N/A | 3.7k | 0 | 0 |
| [STS 2015](http://alt.qcri.org/semeval2015/task2/) | semantic textual similarity | N/A | 8.5k | 0 | 0 |
| [STS 2016](http://alt.qcri.org/semeval2016/task1/) | semantic textual similarity | N/A | 9.2k | 0 | 0 |
| [STS B](http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark#Results) | semantic textual similarity | 5.7k | 1.4k | 1 | 0 |
| [SICK-R](http://clic.cimec.unitn.it/composes/sick.html) | semantic textual similarity | 4.5k | 4.9k | 1 | 0 |
| [COCO](http://mscoco.org/) | image-caption retrieval | 567k | 5*1k | 1 | 0 |
SentEval allows you to evaluate your sentence embeddings as features for the following *downstream* tasks for English and Russian languages:

| Task | Type | Language | #train | #test | needs_train | set_classifier |
|---------- |------------------------------ |----------|-----------:|----------:|:-----------:|:----------:|
| [MR](https://nlp.stanford.edu/~sidaw/home/projects:nbsvm) | movie review | _English_ | 11k | 11k | 1 | 1 |
| [CR](https://nlp.stanford.edu/~sidaw/home/projects:nbsvm) | product review | _English_ | 4k | 4k | 1 | 1 |
| [SUBJ](https://nlp.stanford.edu/~sidaw/home/projects:nbsvm) | subjectivity status | _English_ | 10k | 10k | 1 | 1 |
| [MPQA](https://nlp.stanford.edu/~sidaw/home/projects:nbsvm) | opinion-polarity | _English_ | 11k | 11k | 1 | 1 |
| [SST](https://nlp.stanford.edu/sentiment/index.html) | binary sentiment analysis | _English_ | 67k | 1.8k | 1 | 1 |
| **[SST](https://nlp.stanford.edu/sentiment/index.html)** | **fine-grained sentiment analysis** | _English_ | 8.5k | 2.2k | 1 | 1 |
| [SST_RU](http://study.mokoron.com) | binary sentiment analysis | _Russian_ | 170k | 37.8k | 1 | 1 |
| **[SST_RU](http://www.dialog-21.ru/evaluation/2016/sentiment/)** | **fine-grained sentiment analysis** | _Russian_ | 43.5k | 9.5k | 1 | 1 |
| [TREC](http://cogcomp.cs.illinois.edu/Data/QA/QC/) | question-type classification | _English_ | 6k | 0.5k | 1 | 1 |
| [TREC_RU (translation)](http://cogcomp.cs.illinois.edu/Data/QA/QC/) | question-type classification | _Russian_ | 5.5k | 0.5k | 1 | 1 |
| [SICK-E](http://clic.cimec.unitn.it/composes/sick.html) | natural language inference | _English_ | 4.5k | 4.9k | 1 | 1 |
| [SICK-E_RU (translation)](http://clic.cimec.unitn.it/composes/sick.html) | natural language inference | _Russian_ | 4.5k | 4.9k | 1 | 1 |
| [SNLI](https://nlp.stanford.edu/projects/snli/) | natural language inference | _English_ | 550k | 9.8k | 1 | 1 |
| [MRPC](https://aclweb.org/aclwiki/Paraphrase_Identification_(State_of_the_art)) | paraphrase detection | _English_ | 4.1k | 1.7k | 1 | 1 |
| [MRPC_RU](https://github.com/Koziev/NLP_Datasets/tree/master/ParaphraseDetection/Data) | paraphrase detection | _Russian_ | 35.5k | 4.5k | 1 | 1 |
| [STS 2012](https://www.cs.york.ac.uk/semeval-2012/task6/) | semantic textual similarity | _English_ | N/A | 3.1k | 0 | 0 |
| [STS 2013](http://ixa2.si.ehu.es/sts/) | semantic textual similarity | _English_ | N/A | 1.5k | 0 | 0 |
| [STS 2014](http://alt.qcri.org/semeval2014/task10/) | semantic textual similarity | _English_ | N/A | 3.7k | 0 | 0 |
| [STS 2015](http://alt.qcri.org/semeval2015/task2/) | semantic textual similarity | _English_ | N/A | 8.5k | 0 | 0 |
| [STS 2016](http://alt.qcri.org/semeval2016/task1/) | semantic textual similarity | _English_ | N/A | 9.2k | 0 | 0 |
| [STS B](http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark#Results) | semantic textual similarity | _English_ | 5.7k | 1.4k | 1 | 0 |
| [STS B RU (translation)](http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark#Results) | semantic textual similarity | _Russian_ | 5.7k | 1.4k | 1 | 0 |
| [SICK-R](http://clic.cimec.unitn.it/composes/sick.html) | semantic textual similarity | _English_ | 4.5k | 4.9k | 1 | 0 |
| [SICK-R_RU (translation)](http://clic.cimec.unitn.it/composes/sick.html) | semantic textual similarity | _Russian_ | 4.5k | 4.9k | 1 | 0 |
| [COCO](http://mscoco.org/) | image-caption retrieval | _English_ | 567k | 5*1k | 1 | 0 |

where **needs_train** means a model with parameters is learned on top of the sentence embeddings, and **set_classifier** means you can define the parameters of the classifier in the case of a classification task (see below).

Expand All @@ -48,25 +57,26 @@ Note: COCO comes with ResNet-101 2048d image embeddings. [More details on the ta
### Probing tasks
SentEval also includes a series of [*probing* tasks](https://github.com/facebookresearch/SentEval/tree/master/data/probing) to evaluate what linguistic properties are encoded in your sentence embeddings:

| Task | Type | #train | #test | needs_train | set_classifier |
|---------- |------------------------------ |-----------:|----------:|:-----------:|:----------:|
| [SentLen](https://github.com/facebookresearch/SentEval/tree/master/data/probing) | Length prediction | 100k | 10k | 1 | 1 |
| [WC](https://github.com/facebookresearch/SentEval/tree/master/data/probing) | Word Content analysis | 100k | 10k | 1 | 1 |
| [TreeDepth](https://github.com/facebookresearch/SentEval/tree/master/data/probing) | Tree depth prediction | 100k | 10k | 1 | 1 |
| [TopConst](https://github.com/facebookresearch/SentEval/tree/master/data/probing) | Top Constituents prediction | 100k | 10k | 1 | 1 |
| [BShift](https://github.com/facebookresearch/SentEval/tree/master/data/probing) | Word order analysis | 100k | 10k | 1 | 1 |
| [Tense](https://github.com/facebookresearch/SentEval/tree/master/data/probing) | Verb tense prediction | 100k | 10k | 1 | 1 |
| [SubjNum](https://github.com/facebookresearch/SentEval/tree/master/data/probing) | Subject number prediction | 100k | 10k | 1 | 1 |
| [ObjNum](https://github.com/facebookresearch/SentEval/tree/master/data/probing) | Object number prediction | 100k | 10k | 1 | 1 |
| [SOMO](https://github.com/facebookresearch/SentEval/tree/master/data/probing) | Semantic odd man out | 100k | 10k | 1 | 1 |
| [CoordInv](https://github.com/facebookresearch/SentEval/tree/master/data/probing) | Coordination Inversion | 100k | 10k | 1 | 1 |
| Task | Type | Language | #train | #test | needs_train | set_classifier |
|---------- |------------------------------ |----------|-----------:|----------:|:-----------:|:----------:|
| [SentLen](https://github.com/facebookresearch/SentEval/tree/master/data/probing) | Length prediction | _English_ | 100k | 10k | 1 | 1 |
| [WC](https://github.com/facebookresearch/SentEval/tree/master/data/probing) | Word Content analysis | _English_ | 100k | 10k | 1 | 1 |
| [TreeDepth](https://github.com/facebookresearch/SentEval/tree/master/data/probing) | Tree depth prediction | _English_ | 100k | 10k | 1 | 1 |
| [TopConst](https://github.com/facebookresearch/SentEval/tree/master/data/probing) | Top Constituents prediction | _English_ | 100k | 10k | 1 | 1 |
| [BShift](https://github.com/facebookresearch/SentEval/tree/master/data/probing) | Word order analysis | _English_ | 100k | 10k | 1 | 1 |
| [Tense](https://github.com/facebookresearch/SentEval/tree/master/data/probing) | Verb tense prediction | _English_ | 100k | 10k | 1 | 1 |
| [SubjNum](https://github.com/facebookresearch/SentEval/tree/master/data/probing) | Subject number prediction | _English_ | 100k | 10k | 1 | 1 |
| [ObjNum](https://github.com/facebookresearch/SentEval/tree/master/data/probing) | Object number prediction | _English_ | 100k | 10k | 1 | 1 |
| [SOMO](https://github.com/facebookresearch/SentEval/tree/master/data/probing) | Semantic odd man out | _English_ | 100k | 10k | 1 | 1 |
| [CoordInv](https://github.com/facebookresearch/SentEval/tree/master/data/probing) | Coordination Inversion | _English_ | 100k | 10k | 1 | 1 |

## Download datasets
To get all the transfer tasks datasets, run (in data/downstream/):
To get all English transfer tasks datasets, run (in data/En/downstream/):
```bash
./get_transfer_data.bash
```
This will automatically download and preprocess the downstream datasets, and store them in data/downstream (warning: for MacOS users, you may have to use p7zip instead of unzip). The probing tasks are already in data/probing by default.
This will automatically download and preprocess the English downstream datasets, and store them in data/En/downstream (warning: for MacOS users, you may have to use p7zip instead of unzip). The probing tasks are already in data/probing by default.
All Russian datasets is already saved in data/Ru/.

## How to use SentEval: examples

Expand Down Expand Up @@ -107,6 +117,64 @@ We also provide example scripts for three other encoders:
Note that for SkipThought and GenSen, following the steps of the associated githubs is necessary.
The Google encoder script should work as-is.

### RU_EN_examples

We provide example scripts for comparison of models for Russian and English

### RU_EN_examples/bert.py

* [Multilingual BERT encoder](https://github.com/google-research/bert)

In RU_EN_examples/bert.py, we evaluate the quality of the average of Multilingual BERT word embeddings.

To reproduce the results, run (in RU_EN_examples/):
```bash
python bert.py
```

### RU_EN_examples/muse.py

* [Multilingual Universal Sentence Encoder](https://tfhub.dev/google/universal-sentence-encoder-multilingual-large/3)

In RU_EN_examples/muse.py, we evaluate the quality of Multilingual Universal Sentence Encoder.

To reproduce the results, run (in RU_EN_examples/):
```bash
python muse.py
```

### RU_EN_examples/elmo.py & RU_EN_examples/elmo_ru.py

* [English ELMo encoder](https://tfhub.dev/google/elmo/3)
* [Russian ELMo encoder](http://docs.deeppavlov.ai/en/master/features/pretrained_vectors.html)

In RU_EN_examples/elmo.py and RU_EN_examples/elmo_ru.py, we evaluate the quality of the average of ELMo word embeddings.

To reproduce the results, run (in RU_EN_examples/):
```bash
python elmo.py
python elmo_ru.py
```

### RU_EN_examples/word2vec.py & RU_EN_examples/word2vec_ru.py

* [English Word2Vec encoder](https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit)
* [Russian Word2Vec encoder](https://wikipedia2vec.github.io/wikipedia2vec/pretrained/)

In RU_EN_examples/word2vec.py and RU_EN_examples/word2vec_ru.py, we evaluate the quality of the average of Word2Vec word embeddings.

To get the **Word2Vec** models and reproduce our results, download best models and run python scripts (in RU_EN_examples/):
```bash
mkdir word2vec
curl -Lo word2vec/GoogleNews-vectors-negative300.bin https://drive.google.com/u/0/uc?export=download&confirm=WQpT&id=0B7XkCwpI5KDYNlNUTTlSS21pQmM
mkdir word2vec_ru
curl -Lo word2vec_ru/ruwiki_20180420_300d.pkl.bz2 http://wikipedia2vec.s3.amazonaws.com/models/es/2018-04-20/eswiki_20180420_300d.pkl.bz2
bzip2 -d word2vec_ru/ruwiki_20180420_300d.pkl.bz2 word2vec_ru/

python word2vec.py
python word2vec_ru.py
```

## How to use SentEval

To evaluate your sentence embeddings, SentEval requires that you implement two functions:
Expand Down Expand Up @@ -168,11 +236,12 @@ results = se.eval(transfer_tasks)
```
The current list of available tasks is:
```python
['CR', 'MR', 'MPQA', 'SUBJ', 'SST2', 'SST5', 'TREC', 'MRPC', 'SNLI',
'SICKEntailment', 'SICKRelatedness', 'STSBenchmark', 'ImageCaptionRetrieval',
'STS12', 'STS13', 'STS14', 'STS15', 'STS16',
'Length', 'WordContent', 'Depth', 'TopConstituents','BigramShift', 'Tense',
'SubjNumber', 'ObjNumber', 'OddManOut', 'CoordinationInversion']
['SST2', 'SST5', 'TREC', 'MRPC', 'SICKRelatedness', 'SICKEntailment', 'STSBenchmark',
'SST2_RU', 'SST3_RU', 'TREC_RU', 'MRPC_RU', 'SICKRelatedness_RU', 'SICKEntailment_RU',
'STSBenchmark_RU',
'CR', 'MR', 'MPQA', 'SUBJ', 'SNLI', 'STS12', 'STS13', 'STS14', 'STS15', 'STS16',
'ImageCaptionRetrieval', 'Length', 'WordContent', 'Depth', 'TopConstituents', 'BigramShift',
'Tense', 'SubjNumber', 'ObjNumber', 'OddManOut', 'CoordinationInversion']
```

## SentEval parameters
Expand Down
70 changes: 70 additions & 0 deletions RU_EN_examples/bert.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
from __future__ import absolute_import, division

import sys
import logging
import torch
from pytorch_pretrained_bert import BertTokenizer, BertModel

# Set PATHs
PATH_TO_SENTEVAL = '../'
PATH_TO_DATA = '../data'

# import SentEval
sys.path.insert(0, PATH_TO_SENTEVAL)
import senteval


# SentEval prepare and batcher
def prepare(params, samples):
# Initialize Multilingual BERT model
params.tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')
params.model = BertModel.from_pretrained('bert-base-multilingual-cased')
params.model.eval()
return


def get_sentence_embedding(text, params):
text = text[:500]
marked_text = "[CLS] " + text + " [SEP]"
tokenized_text = params.tokenizer.tokenize(marked_text)
indexed_tokens = params.tokenizer.convert_tokens_to_ids(tokenized_text)
segments_ids = [1] * len(tokenized_text)
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])

with torch.no_grad():
encoded_layers, _ = params.model(tokens_tensor, segments_tensors)
token_embeddings = torch.stack(encoded_layers, dim=0)
token_embeddings = torch.squeeze(token_embeddings, dim=1)
token_embeddings = token_embeddings.permute(1, 0, 2)
token_vecs = []
for token in token_embeddings:
sum_vec = torch.sum(token[-4:], dim=0)
token_vecs.append(sum_vec)
token_vecs = encoded_layers[11][0]
sentence_embedding = torch.mean(token_vecs, dim=0)
return sentence_embedding.numpy()


def batcher(params, batch):
batch = [' '.join(sent) if sent != [] else '.' for sent in batch]
embeddings = [get_sentence_embedding(sentence, params) for sentence in batch]
return embeddings


# Set params for SentEval
params_senteval = {'task_path': PATH_TO_DATA, 'usepytorch': False, 'kfold': 5, 'batch_size': 128,
'classifier': {'nhid': 0, 'optim': 'rmsprop', 'tenacity': 3, 'epoch_size': 2}}

# Set up logger
logging.basicConfig(format='%(asctime)s : %(message)s', level=logging.DEBUG)

if __name__ == "__main__":
se = senteval.engine.SE(params_senteval, batcher, prepare)
transfer_tasks = ['SICKEntailment', 'SST2', 'SST5', 'TREC', 'MRPC',
'SICKEntailment_RU', 'SST2_RU', 'SST3_RU', 'TREC_RU', 'MRPC_RU'
'STSBenchmark', 'SICKRelatedness',
'STSBenchmark_RU', 'SICKRelatedness_RU'
]
results = se.eval(transfer_tasks)
print(results)
50 changes: 50 additions & 0 deletions RU_EN_examples/elmo.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
from __future__ import absolute_import, division

import sys
import logging
import tensorflow_hub as hub
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()

# Set PATHs
PATH_TO_SENTEVAL = '../'
PATH_TO_DATA = '../data'

# import SentEval
sys.path.insert(0, PATH_TO_SENTEVAL)
import senteval


# SentEval prepare and batcher
def prepare(params, samples):
# Load model
params.model = hub.Module("https://tfhub.dev/google/elmo/3", trainable=False)
return


def batcher(params, batch):
batch = [' '.join(sent) if sent != [] else '.' for sent in batch]
embeddings = params.model(batch, signature="default", as_dict=True)["elmo"]

with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
sess.run(tf.tables_initializer())

embeddings_array = sess.run(tf.reduce_mean(embeddings, 1))
return embeddings_array


# Set params for SentEval
params_senteval = {'task_path': PATH_TO_DATA, 'usepytorch': False, 'kfold': 5, 'batch_size': 128,
'classifier': {'nhid': 0, 'optim': 'rmsprop', 'tenacity': 3, 'epoch_size': 2}}

# Set up logger
logging.basicConfig(format='%(asctime)s : %(message)s', level=logging.DEBUG)

if __name__ == "__main__":
se = senteval.engine.SE(params_senteval, batcher, prepare)
transfer_tasks = ['SICKEntailment', 'SST2', 'SST5', 'TREC', 'MRPC',
'STSBenchmark', 'SICKRelatedness'
]
results = se.eval(transfer_tasks)
print(results)
Loading