Skip to content

bellDataSc/TRABALHOSPython

Repository files navigation

TRABALHOSPython - Python Data Engineering Projects

Professional data processing, analysis, and OCR extraction system.

Python 3.10+ | MIT License | Code style: black

Portugues | English


Repositorios Relacionados - Veja Minha Evolucao

Projetos refinados que demonstram crescimento tecnico e arquitetura profissional:


Portugues

Sobre o Projeto

Repositorio com modulos profissionais para engenharia de dados, analise de dados e extracao de informacoes por OCR. Sistema integrado para processamento de faturas fiscais brasileiras, gestao de banco de dados e analise de dados estruturados.

Funcionalidades

  • CSV Handler (src/csv_handler.py) - Processamento seguro de arquivos CSV com validacao de caminho
  • Database Manager (src/database.py) - Gerenciador SQLAlchemy com queries parametrizadas (SQL injection safe)
  • Invoice OCR (src/invoice_ocr.py) - Extrator consolidado de dados de faturas PDF com Tesseract + OpenCV
  • Invoice Analysis (src/invoice_analysis.py) - Analise e exportacao de dados de faturas para Excel

Estrutura

TRABALHOSPython/
├── src/
│   ├── csv_handler.py          # Gerenciador CSV
│   ├── database.py             # Gerenciador banco de dados
│   ├── invoice_ocr.py          # Extrator OCR (consolidado)
│   └── invoice_analysis.py     # Analise de faturas
├── tests/                      # Testes unitarios
├── examples/                   # Scripts de exemplo
├── requirements.txt            # Dependencias do projeto
├── .gitignore                  # Git ignore profissional
└── README.md                   # Este arquivo

Instalacao

# Clone o repositorio
git clone https://github.com/bellDataSc/TRABALHOSPython.git
cd TRABALHOSPython

# Crie um ambiente virtual
python -m venv venv
source venv/bin/activate  # Linux/Mac
# ou
venv\\Scripts\\activate  # Windows

# Instale as dependencias
pip install -r requirements.txt

# Configure tesseract (se usar OCR)
# Windows: https://github.com/UB-Mannheim/tesseract/wiki
# Linux: sudo apt-get install tesseract-ocr
# Mac: brew install tesseract

Exemplos de Uso

CSV Handler

from src.csv_handler import read_csv, save_csv

# Ler CSV
df = read_csv('data.csv')

# Salvar CSV
save_csv(df, 'output.csv')

Database Manager

from src.database import Database

db = Database("sqlite:///database.db")

# Query segura (parametrizada)
df = db.query(
    "SELECT * FROM users WHERE name LIKE :pattern",
    {"pattern": "%John%"}
)

db.close()

Invoice OCR

from pathlib import Path
from src.invoice_ocr import InvoiceOCR

ocr = InvoiceOCR(dpi=300)
df = ocr.process_directory(
    directory=Path("pdfs"),
    output_excel=Path("result.xlsx")
)

Padroes de Codigo

  • Type hints em todas as funcoes
  • Docstrings em formato Google
  • Context managers para gerenciamento de recursos
  • Queries parametrizadas (SQL injection safe)
  • Dataclasses para estruturas de dados
  • Logging estruturado

Melhorias Realizadas

Original Refatorado Ganho
Pandas.py csv_handler.py Type hints + error handling
Query.py database.py SQLAlchemy + seguranca
dadosdog.py + extracdog.py invoice_ocr.py Consolidado em 1 arquivo
script.py invoice_analysis.py Dataclasses + type hints

Testes

# Executar testes
pytest tests/

# Com cobertura
pytest --cov=src tests/

# Verificacao de tipo
mypy src/

# Linting
flake8 src/
black --check src/

Contribuindo

  1. Fork o projeto
  2. Crie uma branch para sua feature (git checkout -b feature/AmazingFeature)
  3. Commit suas mudancas (git commit -m 'Add some AmazingFeature')
  4. Push para a branch (git push origin feature/AmazingFeature)
  5. Abra um Pull Request

Licenca

Este projeto esta sob a licenca MIT. Veja o arquivo LICENSE para mais detalhes.


English

About the Project

Repository with professional modules for data engineering, data analysis, and OCR information extraction. Integrated system for Brazilian fiscal invoice processing, database management, and structured data analysis.

Features

  • CSV Handler (src/csv_handler.py) - Safe CSV file processing with path validation
  • Database Manager (src/database.py) - SQLAlchemy manager with parameterized queries (SQL injection safe)
  • Invoice OCR (src/invoice_ocr.py) - Consolidated PDF invoice data extractor with Tesseract + OpenCV
  • Invoice Analysis (src/invoice_analysis.py) - Invoice data analysis and Excel export

Structure

TRABALHOSPython/
├── src/
│   ├── csv_handler.py          # CSV manager
│   ├── database.py             # Database manager
│   ├── invoice_ocr.py          # OCR extractor (consolidated)
│   └── invoice_analysis.py     # Invoice analysis
├── tests/                      # Unit tests
├── examples/                   # Example scripts
├── requirements.txt            # Project dependencies
├── .gitignore                  # Professional gitignore
└── README.md                   # This file

Installation

# Clone the repository
git clone https://github.com/bellDataSc/TRABALHOSPython.git
cd TRABALHOSPython

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Linux/Mac
# or
venv\\Scripts\\activate  # Windows

# Install dependencies
pip install -r requirements.txt

# Setup tesseract (if using OCR)
# Windows: https://github.com/UB-Mannheim/tesseract/wiki
# Linux: sudo apt-get install tesseract-ocr
# Mac: brew install tesseract

Usage Examples

CSV Handler

from src.csv_handler import read_csv, save_csv

# Read CSV
df = read_csv('data.csv')

# Save CSV
save_csv(df, 'output.csv')

Database Manager

from src.database import Database

db = Database("sqlite:///database.db")

# Safe query (parameterized)
df = db.query(
    "SELECT * FROM users WHERE name LIKE :pattern",
    {"pattern": "%John%"}
)

db.close()

Invoice OCR

from pathlib import Path
from src.invoice_ocr import InvoiceOCR

ocr = InvoiceOCR(dpi=300)
df = ocr.process_directory(
    directory=Path("pdfs"),
    output_excel=Path("result.xlsx")
)

Code Standards

  • Type hints in all functions
  • Google-style docstrings
  • Context managers for resource management
  • Parameterized queries (SQL injection safe)
  • Dataclasses for data structures
  • Structured logging

Improvements Made

Original Refactored Benefit
Pandas.py csv_handler.py Type hints + error handling
Query.py database.py SQLAlchemy + security
dadosdog.py + extracdog.py invoice_ocr.py Consolidated into 1 file
script.py invoice_analysis.py Dataclasses + type hints

Testing

# Run tests
pytest tests/

# With coverage
pytest --cov=src tests/

# Type checking
mypy src/

# Linting
flake8 src/
black --check src/

Contributing

  1. Fork the project
  2. Create a feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

License

This project is licensed under the MIT License. See the LICENSE file for details.


Author

Isabel Cruz - @bellDataSc

LinkedIn: belcruz | Medium: @belgon | GitHub: bellDataSc

About

Trabalhos - Técnicas usadas

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages