A Python tool that detects and extracts tabular data from PDF bank statements using deep learning models and OCR. Outputs structured data in CSV format.
- Table detection using Microsoft's Table Transformer
- Table structure recognition (rows/columns)

- Dual OCR validation with EasyOCR and Tesseract
- Automatic CSV generation with data cleaning
- PDF-to-image conversion support
-
PDF to Image Conversion: The input PDF is converted to an image using pdf2image and poppler-utils.
-
Table Detection: The Table Transformer model detects table boundaries in the image. Bounding boxes are drawn around tables using coordinates from model predictions.
-
Table Cropping: The detected table region is cropped from the image for further processing.
-
Table Structure Recognition: A second Transformer model identifies internal structures (rows, columns, cells) within the cropped table image.
-
Cell Coordinate Calculation: Rows and columns are sorted, and cell coordinates are mapped using their bounding boxes.
-
OCR Text Extraction: Each cell is processed with Tesseract and EasyOCR. Results are combined to improve accuracy (e.g., selecting longer text to avoid truncation). Empty cells are filtered using pixel density checks.
-
Post-Processing: Text is cleaned (e.g., removing invalid characters, fixing date formats like "AOR" → "Apr"). Data is aligned into a 5-column structure (Date, Description, Withdrawal, Deposit, Balance).
-
CSV Output: Final data is written to output.csv, with headers validated and numeric fields standardized.
-
Result Verification: The CSV is read using Pandas and displayed to the user.