Skip to content

Conversation

@polyfloyd
Copy link

I'm trying to set up a workflow with multiple importers, one of which is a fall-back that matches kind of PDF document.

There currently is not really a way to achieve this, since beangulp does not permit multiple importers to be able to import a given file.

To work around this, I can create multiple distinct importer scripts. But these should then be able to tell whether they are able to process a given file. I think the most common way for this is to set an exit status, which is what this patch proposes.

The status being set is 100, which is distinct from any other error. The exit status is only set of no importer is available for any of the specified files,

@dnicolodi
Copy link
Collaborator

I'm not sure I understand how your cascaded import script are supposed to work and how a specific exit status may help there. Can you provide more details? The way I would do it is to have the more specific importers run first and archive the processed files (to a temporary location that gets removed at the end of the import process, if you don't want to keep them around), to leave only the unmatched ones to be processed by the generic importer. Running the generic importer when no files to be handled are left is not an error. On the other hand, the proposed change cannot tell you if the more specific importer matched some files, but other are left to be handled by the generic importer.

Making the case where there are no source documents to be imported into an error is a breaking change that I would like to avoid. The choice of 100 as exit status is particularly bad as conventionally high exit statuses are reserved for system error conditions.

@polyfloyd
Copy link
Author

I have importers in distinct scripts. For example, I have importers that know the formats of invoices for deposit returns and my payment provider that can quickly and reliably produce the correct transactions.

Then, I have an importer script that spins up an LLM to look for the information in the document to fill out a transaction. But it should not attempt to process the documents for well-known formats.

With exit codes, I could string these scripts together using shell:

./scripts/import-known ... || ./scripts/import-with-llm ...

@dnicolodi
Copy link
Collaborator

dnicolodi commented Jan 13, 2026

What you propose is exactly the scheme that does not work. Let's assume that there is a FooImporter that handles reports in a well known format and a GenericImporter that handles everything else. Assume that you have these documents:

foo1.pdf
foo2.pdf
report3.pdf

where the first two documents can be handled by FooImporter and the third needs to be handled by GenericImporter.

You propose to have two import script, import-known that runs FooImporter and import-with-llm that runs GenericImporter.

If you run them in sequence

import-known || import-with-llm

you get that FooImporter recognizes foo1.pdf and foo2.pdf and processes them and terminates with a success zero status, and therefore the same reports get processed again by GenericImporter. You propose to change the import process to a non-zero exit status if import-known does not match any document. However, in this case import-known would match two document and still terminate with a zero exit status. What you propose would work only of you are certain that there are no documents matched by import-known in the imported documents folder. But if you already know that, you can simply run import-with-llm only and be happy.

If you want to make the process independent on what kind of documents are in the import folder, you can simply execute something on the lines of:

docs="Downloads/"
temp=$(mktemp -d)
import-known extract "$docs" && \
import-known archive "$docs" -o "$temp" && \
import-with-llm extract "$docs" && \
import-with-llm archive "$docs" -o "$temp" && \
rm -r "$temp"

In this case, import-known processes the foo1.pdf and foo2.pdf and moves them out of the way, and then import-with-llm processes the remaining report3.pdf and moves it out of the way, for good measure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants