Skip to content

ioanachelaru/doc2vec-experiments

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Doc2Vec Code Embeddings Pipeline

A GitHub Actions-powered pipeline for training Doc2Vec models on source code. Features both base model training on popular repositories and fine-tuning capabilities for specific codebases.

Features

  • Two-Stage Training: Train a base model on popular repos, then fine-tune on your specific codebase
  • Base Model Training: Build robust representations from top GitHub repositories
  • Fine-Tuning Pipeline: Adapt pre-trained models to your specific repository
  • Popular Repos Discovery: Automatically fetch the most popular repositories by language
  • Language Agnostic: Support for any programming language (Java, Python, JavaScript, etc.)
  • Smart Embeddings: Generate embeddings using models trained on relevant codebases

Quick Start

Two-Stage Approach

Step 1: Train Base Model on Popular Repositories

  1. Go to Actions"Train Base Doc2Vec Model"
  2. Configure:
    • Language: Select the programming language (java, python, etc.)
    • Repository count: Number of top repos to use (default: 100, max: 1000)
    • File extensions: Extensions to analyze (e.g., .java)
    • Vector size: Embedding dimensions (default: 200)
  3. Run workflow and wait for completion
  4. Note the artifact name (e.g., base-model-java)

Step 2: Fine-tune on Your Repository

  1. Go to Actions"Fine-tune Doc2Vec Model"
  2. Configure:
    • Repository URL: Your target repository
    • Repository version (optional): Specific commit SHA, tag, or branch to analyze
      • Leave empty for latest code
      • Examples: v1.0.0, main, abc123def
    • Run ID: Copy the run ID from the base model workflow URL
      • Example: If base model URL is .../actions/runs/1234567890, use 1234567890
    • Base model artifact: Name from Step 1 (e.g., base-model-java-100repos)
    • File extensions: Extensions to analyze
    • Fine-tune epochs: Training iterations (default: 10)
    • Update vocabulary: Whether to add new words from target repo (default: true)
  3. Run workflow to generate embeddings for your repository

Local Usage

# Clone the repo
git clone https://github.com/yourusername/doc2vec-experiments.git
cd doc2vec-experiments

# Install dependencies
pip install -r requirements.txt

# Step 1: Get popular repositories (max 1000 due to GitHub API limits)
python src/get_popular_repos.py \
    --language java \
    --count 100 \
    --min-stars 1000 \
    --output popular_repos.txt

# Step 2: Train base model on popular repos
python src/train_base_model.py \
    --repos popular_repos.txt \
    --ext .java \
    --output base_model.d2v \
    --vector-size 200 \
    --epochs 20

# Step 3: Fine-tune on your repository
python src/finetune_and_embed.py \
    --repo https://github.com/your-org/your-repo.git \
    --base-model base_model.d2v \
    --ext .java \
    --output your_repo \
    --epochs 10 \
    --version v1.0.0  # Optional: specific version to analyze

Workflows Available

1. Train Base Model (train-base-model.yaml)

  • Trains a Doc2Vec model on multiple popular repositories
  • Creates a robust base representation for code
  • Outputs: base model, metadata, sample embeddings

2. Fine-tune Model (finetune-model.yaml)

  • Takes a pre-trained base model
  • Fine-tunes it on your specific repository
  • Generates embeddings optimized for your codebase
  • Outputs: embeddings CSV, fine-tuned model, metadata

Output Files

Base Model Training

  • base_model_{language}_{count}repos.d2v: Trained Doc2Vec model
  • base_model_{language}_{count}repos.json: Training metadata (repos used, parameters)
  • base_model_{language}_{count}repos.sample.csv: Sample embeddings for validation
  • popular_repos.txt: List of repository URLs used for training

Fine-tuning

  • {repo_name}_embeddings.csv: Document vectors for each source file
    • Column 1: Repository-relative file path
    • Columns 2-201: 200-dimensional embedding vectors (or as configured)
  • {repo_name}_finetuned.d2v: Fine-tuned Doc2Vec model
  • {repo_name}_metadata.json: Fine-tuning metadata including:
    • Base model used
    • Repository URL and version (if specified)
    • Total documents processed
    • Vector size and vocabulary size

Doc2Vec Configuration

Default training parameters:

  • Vector size: 200 dimensions
  • Window size: 5 tokens
  • Min count: 3 (minimum word frequency)
  • Training epochs: 20
  • Model: DM (Distributed Memory)

Use Cases

  • Code similarity: Find similar code files across projects
  • Code search: Semantic search through codebases
  • Code classification: Classify code by functionality or quality
  • Technical debt analysis: Identify problematic code patterns
  • Cross-project analysis: Compare coding styles across repositories

Project Structure

doc2vec-experiments/
├── .github/
│   └── workflows/
│       ├── train-base-model.yaml    # Train base model on popular repos
│       └── finetune-model.yaml      # Fine-tune model on specific repo
├── src/
│   ├── train_base_model.py          # Train single model on multiple repos
│   ├── finetune_and_embed.py        # Fine-tune model and generate embeddings
│   ├── get_popular_repos.py         # Fetch popular repos from GitHub API
│   └── utils.py                     # Shared utilities for all scripts
├── requirements.txt                 # Python dependencies
└── README.md

Requirements

  • Python 3.10+
  • Dependencies: gensim, pandas, tqdm, scikit-learn, requests
  • Git (for repository cloning)

Limitations

  • GitHub API Rate Limits:
    • Maximum 1000 repositories per search query due to GitHub Search API limitations
    • To train on more repos, consider lowering min_stars parameter or using different date ranges
  • Processing Time:
    • Training on hundreds of repositories can take several hours
    • GitHub Actions has a 6-hour timeout for workflows
  • Memory Usage:
    • Large models with many repositories require significant RAM
    • The workflow automatically sets up swap space for large training jobs

Contributing

Contributions are welcome! Feel free to:

  • Add support for more languages
  • Improve tokenization strategies
  • Add visualization tools
  • Enhance the training pipeline

License

MIT License - see LICENSE file for details

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages