SMaTD (Surrogate Machine Translation Detection) is a tool implemented in python that leverages the internal representations of a machine translation (MT) system to determine, given a source sentence, whether a translation is human- or machine-generated.
In this repository, you will find the code, along with the dataset and the models trained during our research, available in the releases section.
To install SMaTD, first clone the code from the repository:
git clone https://github.com/transducens/SMaTD.gitCreate a conda environment to isolate the python dependencies and install pytorch:
conda create -n smatd -c conda-forge python==3.11.9
conda activate smatd
# Follow https://pytorch.org/get-started/locally/ to install pytorch 2.4.0Install SMaTD:
cd SMaTD
pip3 install .Check out the installation:
# Usage
smatd --help
smatd-lm-baseline --helpSome scripts require pickle files, but generating sentence representations on the fly is also supported. Use smatd/nllb_get_log_prob.py to create these files. This consumes more disk space but greatly reduces time when training for multiple epochs.
Check the --help flag for smatd and smatd-lm-baseline to see the available configuration options. The experiment scripts may also serve as a good starting point for training. For inference, you may check out the file inference_example.sh.
If you use SMaTD or the resources provided in this repository, please cite our work as follows:
@misc{garcíaromero2025automaticmachinetranslationdetection,
title={Automatic Machine Translation Detection Using a Surrogate Multilingual Translation Model},
author={Cristian Garc\'ia-Romero, Miquel Espl\`a-Gomis, Felipe S\'anchez-Mart\'inez},
year={2025},
eprint={2511.02958},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2511.02958},
}
This work was co-funded by the Conselleria de Innovación, Universidades, Ciencia y Sociedad Digital (Generalitat Valenciana) and the European Social Fund through grant CIACIF/2021/365. It was also part of the work conducted in R+D+i projects PID2021-27999NB-I00 and PID2024-158157OB-C31 funded by the Spanish Ministry of Science and Innovation (MCIN), the Spanish Research Agency (AEI/10.13039/501100011033) and the European Regional Development Fund A way to make Europe. Some of the computational resources used were funded by the Valencia Government and the European Regional Development Fund (ERDF) through project IDIFEDER/2020/003.

