Implement corpus-based evaluation

Strategy:

1) export / extract the codes for the targets  from set A
2) classify, then export/extract codes for the predictions from set B
3) use standard chunk evaluation code (huggingface?) to calculate metrics