Skip to content

buptanswer/pyimport2pkg

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

12 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

PyImport2Pkg

🐍 Reverse mapping from Python import statements to pip package names

Python 3.10+ License: MIT Latest Release

Language: English | δΈ­ζ–‡

πŸ“‹ Table of Contents


Introduction

PyImport2Pkg solves a core problem in the AI-assisted coding era:

Given Python import statements in code, how do we quickly and accurately know which pip packages need to be installed?

Problem Statement

In traditional development, pip package names usually match import module names. However, in practice, many popular libraries have package name β‰  module name:

  • import cv2 β†’ install pip install opencv-python
  • from PIL import Image β†’ install pip install Pillow
  • import sklearn β†’ install pip install scikit-learn
  • import google.cloud.storage β†’ install pip install google-cloud-storage

When AI generates code with dozens of imports, manually looking up each mapping is time-consuming and error-prone. PyImport2Pkg automates this.


Why This Tool?

The Challenge

When using AI code generators (like GitHub Copilot, Claude, or ChatGPT), you often get code like:

import cv2
import numpy as np
from sklearn.model_selection import train_test_split
from google.cloud import storage
import requests

Question: Which packages do you need to pip install?

Without PyImport2Pkg

  • ❌ Manually Google each module name
  • ❌ Check PyPI documentation
  • ❌ Risk installing wrong packages
  • ❌ Takes 5-10 minutes for 10 imports

With PyImport2Pkg

$ pyimport2pkg analyze ./my_ai_generated_code

Dependencies:
  opencv-python
  numpy
  scikit-learn
  google-cloud-storage
  requests

Done in seconds! βœ…


Core Features

🎯 Key Capabilities

Feature Description
Project Analysis Recursively scan Python projects, extract all imports, generate requirements.txt
Smart Mapping Multi-tier priority system for accurate module→package mapping
Namespace Support Correctly handle google.*, azure.*, zope.* namespace packages
Optional Deps Distinguish required vs optional imports (try-except, platform-specific)
Version-Aware Auto-detect target Python version, handle backport packages
High-Performance DB Smart incremental updates, true parallel processing, batch writes
Interrupt Recovery Support resuming from breakpoint without data loss

Mapping Priority

PyImport2Pkg uses a multi-tier priority system:

  1. Namespace packages - When submodules detected (e.g., google.cloud.storage β†’ google-cloud-storage)
  2. Hardcoded mappings - Known special cases (e.g., cv2 β†’ opencv-python)
  3. PyPI database - From top_level.txt in wheel files
  4. Smart guess - Assume module name equals package name

Installation

Requirements

  • Python 3.10+
  • Minimal dependencies (only httpx>=0.25.0)

Install via pip

pip install pyimport2pkg

Install in development mode

git clone https://github.com/buptanswer/pyimport2pkg.git
cd pyimport2pkg
pip install -e ".[dev]"

Verify Installation

pyimport2pkg --version
# pyimport2pkg 1.0.0

Quick Start

Analyze a Project

# Analyze current directory
pyimport2pkg analyze .

# Output:
# Analyzing: .
# Found imports from 24 files
#
# Dependencies:
#   numpy
#   pandas
#   requests
#   sklearn
#   matplotlib

Query a Single Module

pyimport2pkg query cv2

# Output:
# Module: cv2
# Source: hardcoded
# Candidates:
#   1. opencv-python (recommended)
#   2. opencv-contrib-python
#   3. opencv-python-headless

Save Results

# Save as requirements.txt
pyimport2pkg analyze . -o requirements.txt

# Save as JSON
pyimport2pkg analyze . -o dependencies.json -f json

Commands

analyze - Analyze Project

Scan Python project for imports and identify required packages.

pyimport2pkg analyze <path> [options]

Options:

Option Description Default
-o, --output Output file path stdout
-f, --format Format (requirements|json|simple) requirements
--python-version Target Python version current

Examples:

# Basic analysis
pyimport2pkg analyze /path/to/project

# Specify target Python version
pyimport2pkg analyze . --python-version 3.11

# Save as JSON
pyimport2pkg analyze . -o deps.json -f json

# Simple package list
pyimport2pkg analyze . -f simple

query - Query Module Mapping

Look up which pip package provides a specific module.

pyimport2pkg query <module_name>

Examples:

pyimport2pkg query numpy       # β†’ numpy
pyimport2pkg query cv2         # β†’ opencv-python (+ alternatives)
pyimport2pkg query PIL         # β†’ Pillow
pyimport2pkg query google.cloud.storage  # β†’ google-cloud-storage

build-db - Build Mapping Database

Build PyPI package mapping database. This downloads metadata for top PyPI packages and builds the mapping.

pyimport2pkg build-db [options]

Options:

Option Description Default
--max-packages Target number of PyPI packages 5000
--concurrency Number of parallel workers 50
--resume Resume interrupted build β€”
--retry-failed Retry failed packages only β€”
--rebuild Force rebuild (delete old DB) β€”
--db-path Custom database path data/mapping.db

Examples:

# Build database with top 5000 packages
pyimport2pkg build-db --max-packages 5000

# Resume interrupted build
pyimport2pkg build-db --resume

# Retry only failed packages
pyimport2pkg build-db --retry-failed

# Expand existing database
pyimport2pkg build-db --max-packages 10000

# Force rebuild
pyimport2pkg build-db --rebuild --max-packages 5000

Features:

  • βœ… Smart incremental updates (no reprocessing)
  • βœ… Interrupt recovery with progress tracking
  • βœ… Parallel processing (50x by default)
  • βœ… Batch database writes
  • βœ… Rate limit detection & auto-recovery
  • βœ… Memory-optimized chunked processing

build-status - Check Build Status

View current or last build status.

pyimport2pkg build-status

# Output:
# Build Status: completed
# Total: 5000
# Processed: 5000
# Failed: 8
# Success Rate: 99.8%
# Last Updated: 2025-12-06 10:30:45

db-info - Database Information

Show database statistics.

pyimport2pkg db-info

# Output:
# Database Information
# ===================
# Database: data/mapping.db
# Packages: 5000
# Modules: 25000
# Last Updated: 2025-12-06 08:00:00

Advanced Features

v0.3.0 Highlights

1. Smart Incremental Updates

Extend your database without reprocessing:

# Database has 500 packages, expand to 1000
pyimport2pkg build-db --max-packages 1000
# Automatically processes only 500 new packages

2. Interrupt & Resume

Resume from breakpoint:

# Start build
pyimport2pkg build-db --max-packages 5000

# Later, resume
pyimport2pkg build-db --resume

3. Failed Package Retry

Retry only failed packages:

# First run: 860 failed
pyimport2pkg build-db --retry-failed

# Second run: only remaining failures
pyimport2pkg build-db --retry-failed

4. Performance Improvements

  • 10-50x faster database writes (batch processing)
  • 50x parallel concurrency (vs 20x in v0.2.0)
  • Memory-optimized chunked processing for 15000+ packages
  • Batch progress saves (every 100 packages)

5. Rate Limit Detection

Automatic PyPI rate limit handling:

Detected 20 consecutive failures - possible rate limiting.
Pausing 30 seconds before retry (pause 1/5)...
Resuming...

6. Graceful Interruption (Ctrl+C)

^C
Saving progress, please wait... (Ctrl+C again to force quit)

Build interrupted. Processed 2500/5000 packages.
Use --resume to continue.

Python API

Use PyImport2Pkg programmatically:

Basic Usage

from pyimport2pkg import Scanner, Parser, Filter, Mapper, Exporter
from pathlib import Path

# 1. Scan project
scanner = Scanner()
files = scanner.scan(Path("./my_project"))

# 2. Parse imports
parser = Parser()
imports = []
for file_path in files:
    imports.extend(parser.parse_file(file_path))

# 3. Filter stdlib & local modules
filter = Filter(project_root=Path("./my_project"))
third_party, _ = filter.filter_imports(imports)

# 4. Map to packages
mapper = Mapper()
results = mapper.map_imports(third_party)

# 5. Export results
exporter = Exporter()
exporter.export_requirements_txt(results, output=Path("requirements.txt"))

Query Single Module

from pyimport2pkg import Mapper, ImportInfo

mapper = Mapper()
imp = ImportInfo.from_module_name("cv2")
result = mapper.map_import(imp)
for candidate in result.candidates:
    print(f"{candidate.package_name}: {candidate.download_count} downloads")

Check Build Status

from pyimport2pkg.database import get_build_progress

progress = get_build_progress()
status = progress.get_status()
print(f"Processed: {status['processed']}/{status['total']}")
print(f"Failed: {status['failed']}")
print(f"Success Rate: {status['success_rate']:.1%}")

Architecture

Pipeline Design

Python Project
    ↓
Scanner (scan for .py files)
    ↓
Parser (extract imports via AST)
    ↓
Filter (remove stdlib, local modules)
    ↓
Mapper (map to pip packages)
    ↓
Resolver (handle conflicts)
    ↓
Exporter (generate output)
    ↓
requirements.txt / JSON / list

Core Modules

Module Purpose
scanner.py Recursively find Python files
parser.py Extract imports with context (AST-based)
filter.py Filter stdlib, local, backports
mapper.py Multi-tier package mapping
resolver.py Handle one-to-many conflicts
exporter.py Multi-format output
database.py PyPI mapping database

Performance

Analysis Speed

Project Size Time Files
Small (<100 files) < 1s ~50
Medium (100-1000) 1-5s ~500
Large (1000+) 5-30s ~2000

Database Build

Packages Time Memory
5000 10-20 min ~200 MB
10000 20-40 min ~400 MB
15000 40-80 min ~600 MB

FAQ

Q: How do I exclude certain directories?

A: Scanner auto-excludes: .git, .venv, venv, env, __pycache__, etc.

For custom exclusions, use Python API:

scanner = Scanner(exclude_dirs=["tests", "docs"])

Q: Does it support relative imports?

A: Yes. Relative imports are marked as local modules and filtered out.

Q: What about conditional imports?

A: Conditional imports (inside if/try blocks) are marked as optional=True.

Q: How long does database build take?

A: Depends on package count and network:

  • 5000 packages: ~10-20 min
  • 10000 packages: ~20-40 min
  • Supports pause/resume

Q: Database not found error?

A: Either:

  1. Build database: pyimport2pkg build-db
  2. Or use online mode without local database

Q: Missing some imports?

Possible reasons:

  1. Package not in top 5000 PyPI
  2. Package metadata incomplete
  3. Non-standard package structure

Troubleshooting

No Python found

# Use explicit Python
python -m pyimport2pkg analyze .

Permission denied

# Ensure read access to project directory
chmod -R +r ./my_project

Out of memory

# Build database in chunks
pyimport2pkg build-db --max-packages 5000  # start small
pyimport2pkg build-db --max-packages 10000 # expand later

Contributing

Report Bugs

File issues at: https://github.com/buptanswer/pyimport2pkg/issues

Include:

  • Python version
  • PyImport2Pkg version
  • Full error traceback
  • Minimal reproduction example

Contribute Code

# Fork repository
git clone https://github.com/YOUR_USERNAME/pyimport2pkg.git
cd pyimport2pkg

# Create feature branch
git checkout -b feature/your-feature

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
pytest tests/ -v

# Make changes & commit
git add .
git commit -m "feat: your feature description"

# Push & create pull request
git push origin feature/your-feature

Development

Setup

pip install -e ".[dev]"

Run Tests

pytest tests/ -v
pytest tests/ --cov=pyimport2pkg  # with coverage

Test Specific Module

pytest tests/test_parser.py -v
pytest tests/test_parser.py::TestParser::test_simple_import -v

License

MIT License - See LICENSE for details


Changelog

See CHANGELOG for detailed version history.

  • v1.0.0 - First stable release (Dec 2025)
  • v0.3.0 - Performance & reliability improvements
  • v0.2.0 - Initial feature release
  • v0.1.0 - Beta version

Support


Acknowledgments

Built for the AI-assisted coding era. Special thanks to users who provided feedback and testing!


Made with ❀️ for developers using AI code generators

PyImport2Pkg v1.0.0 - December 2025