This repository provides translation of Spider, CoSQL, SParC, Spider-DK, Spider-Syn datasets into Polish and code for some experiments.
📄 Associated master thesis: download link.
Polish translations are ready to download from Hugging Face Datasets 🤗
datasets directory contains scripts for dataset synthesis
# clone repository
https://github.com/klima7/Polish-Spider
# create environment
conda create -n pol-spider python=3.19
conda activate pol-spider
pip install -r requirements.txt
# download spacy model
python -m spacy download xx_sent_ud_smThen download oryginal english databases from here and place inside datasets/components/database
Synthesize dataset named pol-spider-en, which is based on samples from spider. Translate questions to polish. Apply context-curated translation to schema names. Translate strings in SQL queries to polish:
python datasets/scripts/synthesize.py spider pol-spider-en \
--question-lang pl \
--schema-translation context-curated \
--query-lang pl \
--with-dbCreate pol-spider dataset by joining pol-spider-en and pol-spider-pl:
python datasets/scripts/join.py pol-spider pol-spider-en pol-spider-plapp directory contains streamlit app, which allows to use C3SQL and RESDSQL models easily.
To use RESDSQL model downloading weights from Hugging Face 🤗 and placing inside app/models is required.
cd app
docker compose up --buildexperiments directory contains dockerized code for experiments with RAT-SQL, BRIDGE, RESDSQL, C3.
evaluation directory contains code for calculating metrics.
