Set up performance metrics eval on squad

... comparing current implementation to fine-tuned models on HF

Research metrics from HF

### Notes

Squad "validaton" datasets, in contrast to "train", can have more than one answer.
* HF pipeline says model is correct if it picks one of the answers
* TBD: confirm this and implement it

See https://huggingface.co/learn/nlp-course/chapter7/7?fw=pt#processing-the-training-data