Code for paper Attend, Copy, Parse - End-to-end information extraction from documents (https://arxiv.org/abs/1812.07248) by Rasmus Berg Palm, Ole Winther and Florian Laws.
- Put data files in
tasks/parsing/data/{amounts,dates}/{train,valid}.tsvfollowing the format in the sample files. - Modify
tasks/parsing/parser.py: set thetypevariable to train either adatesoramountsparser. - Execute
PYTHONPATH="$PWD" python tasks/parsing/train.pyfrom the root of the repository
- Put data files in
tasks/acp/data. One document per file, following the format in the sample file. - Modify split files in
tasks/acp/splits. One document per line - Modify
fieldinAttendCopyParseto train on different fields. Valid values are[number, order_id, date, total, tla, tta, tp] - Execute
PYTHONPATH="$PWD" python tasks/acp/train.pyfrom the root of the repository
- Modify
restore_all_pathintasks/acp/acp.pyto the saved model to restore weights from, e.g../snapshots/acp/best. - Execute
PYTHONPATH="$PWD" python tasks/acp/test.pyfrom the root of the repository
- Every 20 training batches the eval split is evaluated. If the eval loss is better than the best seen so far the model is saved under
./snapshots - Tensorboard summaries are logged to
/tmp/tensorboard
In order of difficulty
- Apply to more domains
- Better non-latin support by using better character set (maybe byte-pair encoding)
- Handle multiple pages
- Remove the need for N-grams
- Take field dependencies into account, e.g. total fields should add up.
- Output invoice lines



