Utility scripts for integrating with Google Cloud Document AI API and processing document data.
Built formedtax-ocr-prototypeon GCP.Let us know if we miss something to provide
- Extracts document data from PDFs/images using Document AI.
- Processes and finalizes structured JSON output.
- Supports local testing or running in Google Cloud Shell.
- Includes a firestore writing process to store data.
Note: If you are using Google Cloud Shell, you can skip the setup section.
Use the following link to auto clone the repository to the Google Cloud Shell
Note: Google Cloud Shell has a weekly quota and limits of 50 hours a week for usage.
- Install Google Cloud SDK:
Download GoogleCloudSDKInstaller.exe - In your project folder, run:
gcloud init
gcloud auth application-default loginThis opens a browser — sign in with the Google account that has project access. Set the project:
gcloud config set project medtax-ocr-prototypeIf a browser didn't open, copy the link and paste it on your browser
Install Dependencies:
pip install -r requirements.txtIn extractor_caller.py, uncomment and update:
gcs_output_uri = "gs://practice_sample_training/docai/"
gcs_input_uri = "gs://run-sources-medtax-ocr-prototype-us-central1/4 form 2307 pictures.pdf"
input_mime_type = "application/pdf"gcs_input_uri: path to the document you want to process.
gcs_output_uri: path where processed files will be saved.
To see the results:
Uncomment print lines in handle_data.py to see results in your terminal.
Or check the output file in your GCS bucket — files ending with _finalized.json contain extracted values.