Moxi

AI-Powered Documentation Generator with Continuous Learning

Moxi is an intelligent documentation generator that analyzes GitHub repositories and creates high-quality, comprehensive documentation using fine-tuned Large Language Models. Moxi implements a complete ML pipeline from data collection to model deployment, with automatic documentation updates on every code push.

What Does Moxi Do?

Moxi automates the tedious process of writing documentation by:

Analyzing your codebase structure and key files
Understanding your project's architecture and purpose
Generating professional README files, architecture docs, and API documentation
Auto-updating documentation when code changes

Key Features

Custom Fine-Tuned Model - Train your own Llama-3.1-8B specialized for documentation
Automated Dataset Generation - Create training data from 10,000+ GitHub repositories
Intelligent Repository Analysis - Parse and understand project structures
Multi-Format Documentation - Generate README, ARCHITECTURE, API docs
A/B Testing - Compare your model against GPT-4
Docker Ready - Easy deployment with Docker Compose
CLI Tool - Simple command-line interface
Auto-Update on Push - Automatically detect code changes and update documentation

Architecture

┌─────────────────────────────────────────────────────────────┐
│                     MOXI PIPELINE                            │
└─────────────────────────────────────────────────────────────┘

1. REPO ANALYZER
    ├── GitHub Crawler
    ├── Structure Parser
    └── File Analyzer
    
2. DATASET GENERATOR
    ├── GitHub Trending Crawler
    ├── Instruction Generator (GPT-4)
    └── Quality Control
    
3. TRAINING PIPELINE
    ├── SFT Trainer (LoRA/QLoRA)
    ├── Experiment Tracking (W&B)
    └── Model Evaluation
    
4. DOC GENERATOR
    ├── Custom Model Inference
    ├── Format & Validate
    └── Export Markdown
    
5. CLI & EVALUATION
    ├── Command-Line Interface
    └── A/B Testing Framework

Quick Start

Prerequisites

Python 3.11+
OpenAI API Key (for dataset generation)
Hugging Face Token (for model training)
GitHub Token (for repo crawling)

Installation

# Clone the repository
git clone https://github.com/LC0229/moxi.git
cd moxi

# Create virtual environment
python -m venv moxi
source moxi/bin/activate  # On Windows: moxi\Scripts\activate

# Install dependencies
make install

# Copy and configure environment variables
cp .env.example .env
# Edit .env with your API keys

Basic Usage

# Test configuration
make test-config

# Analyze a repository
make local-analyze-repo REPO=https://github.com/pytorch/pytorch

# Generate documentation (using pre-trained model)
make local-generate-docs REPO=https://github.com/pytorch/pytorch

Full Pipeline

Step 1: Generate Training Dataset

# Crawl 100+ high-quality GitHub repositories
make crawl-github-repos

# Generate 10,000 training samples using GPT-4
make generate-training-dataset

# Validate dataset quality
make validate-dataset

Cost: ~$20-30 in OpenAI API credits

Step 2: Train Your Model

# Download base Llama-3.1-8B model
make download-base-model

# Train with Supervised Fine-Tuning (SFT)
make train-sft

# Evaluate model performance
make evaluate-model

Time: 4-8 hours on GPU (AWS g5.2xlarge recommended)

Step 3: Generate Documentation

# Use your trained model
make local-generate-docs REPO=https://github.com/user/repo

# Compare with GPT-4 baseline
make compare-models

Development

Project Structure

moxi/
├── src/
│   ├── core/               # Configuration, logging, utilities
│   ├── repo_analyzer/      # GitHub crawler & parser
│   ├── dataset_generator/  # Training data creation
│   ├── training_pipeline/  # Model training & evaluation
│   ├── doc_generator/      # Documentation generation
│   └── cli/                # Command-line interface
├── tests/
│   ├── unit/
│   └── integration/
├── data/                   # Downloaded repositories
├── training_data/          # Generated datasets
├── models/                 # Trained models
└── Makefile               # All commands

Running Tests

# Run all tests
make test

# Run specific test suites
make test-unit
make test-integration

Code Quality

# Lint code
make lint

# Format code
make format

Dataset & Training

Dataset Statistics

Source: GitHub repositories (100+ stars)
Size: 10,000+ instruction-output pairs
Format: JSON with instruction, input, output fields
Quality: Validated by GPT-4 + human review

Training Configuration

Base Model: Meta Llama-3.1-8B-Instruct
Method: Supervised Fine-Tuning (SFT) with LoRA
Hyperparameters:
- Learning Rate: 2e-4
- Batch Size: 4 (per device)
- Gradient Accumulation: 4 steps
- Epochs: 3
- Max Sequence Length: 2048 tokens

Hardware Requirements

Training: GPU with 16GB+ VRAM (NVIDIA A10G, V100, or better)
Inference: GPU with 8GB+ VRAM or CPU (slower)

Docker Deployment

# Build containers
make docker-build

# Start services
make docker-up

# View logs
make docker-logs

# Stop services
make docker-down

CLI Usage

# Generate documentation
python -m cli.main generate https://github.com/user/repo

# Analyze repository structure
python -m cli.main analyze https://github.com/user/repo

# Evaluate model performance
python -m cli.main evaluate --compare custom vs gpt4

Roadmap

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Built with Llama-3.1-8B
Powered by Hugging Face Transformers, PEFT, and TRL

Contact

Author: Shengrui Chen
Email: chenleon572@gmail.com
GitHub: @LC0229

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.github/workflows		.github/workflows
data		data
models		models
src		src
test_folder		test_folder
training_data		training_data
.env.example		.env.example
.gitignore		.gitignore
ARCHITECTURE_BY_MOXI.md		ARCHITECTURE_BY_MOXI.md
Makefile		Makefile
README.md		README.md
README_BY_MOXI.md		README_BY_MOXI.md
docker-compose.yml		docker-compose.yml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
test.txt		test.txt
test_github_api.py		test_github_api.py
test_nested_path.py		test_nested_path.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Moxi

What Does Moxi Do?

Key Features

Architecture

Quick Start

Prerequisites

Installation

Basic Usage

Full Pipeline

Step 1: Generate Training Dataset

Step 2: Train Your Model

Step 3: Generate Documentation

Development

Project Structure

Running Tests

Code Quality

Dataset & Training

Dataset Statistics

Training Configuration

Hardware Requirements

Docker Deployment

CLI Usage

Roadmap

Contributing

License

Acknowledgments

Contact

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

LC0229/moxi

Folders and files

Latest commit

History

Repository files navigation

Moxi

What Does Moxi Do?

Key Features

Architecture

Quick Start

Prerequisites

Installation

Basic Usage

Full Pipeline

Step 1: Generate Training Dataset

Step 2: Train Your Model

Step 3: Generate Documentation

Development

Project Structure

Running Tests

Code Quality

Dataset & Training

Dataset Statistics

Training Configuration

Hardware Requirements

Docker Deployment

CLI Usage

Roadmap

Contributing

License

Acknowledgments

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages