Skip to content

This is a Notebook which detects Invoice Images Data and Extracting records like Invoice Date and Items Description based on tesseract OCR Engine.

Notifications You must be signed in to change notification settings

Remon128/Invoice_OCR_Detection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 

Repository files navigation

Invoice_OCR_Detection

In is project I solve the problem of extracting data out of Images, this Project focus on Invoices Data Extracting which detects Invoice Images Data and Extracting records like Invoice Date and Items Description based on tesseract OCR Engine.

Table of Contents:

1- Importing Libraries

2- Loading Invoices Images

3- Method #1 Extracting data using Data Frames

4- Method #2 Extracting Data through Image to Strings output

5- String Output Preprocessing

6- Printing Extracted Values

Importing Libraries

from PIL import Image
import os
import pandas as pd
import numpy as np
import re,string,unicodedata
#Tesseract Library
import pytesseract
from pytesseract import Output

#Warnings
import warnings
warnings.filterwarnings("ignore")
#Garbage Collection
import gc

#Gensim Library for Text Processing
import gensim.parsing.preprocessing as gsp
from gensim import utils

Loading Invoices

ex1

Method #1 Extracting data using Data Frames

pytesseract.image_to_data("../input/invoice-ocr-data/invoice_2.jpg",output_type = Output.DATAFRAME)

Extracted Data

pytessract_toDF

Method #2 Extracting Data through Image to Strings output

pytesseract.image_to_string(Image.open(filepath), timeout=5)

String Output Preprocessing

# Create list of pre-processing func (gensim)
processes = [
               gsp.strip_tags, 
               gsp.strip_multiple_whitespaces,
               gsp.remove_stopwords, 
            ]

preprocessing_output

Extracted Data

String_Output

About

This is a Notebook which detects Invoice Images Data and Extracting records like Invoice Date and Items Description based on tesseract OCR Engine.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published