A GitHub Actions-powered pipeline for training Doc2Vec models on source code. Features both base model training on popular repositories and fine-tuning capabilities for specific codebases.
- Two-Stage Training: Train a base model on popular repos, then fine-tune on your specific codebase
- Base Model Training: Build robust representations from top GitHub repositories
- Fine-Tuning Pipeline: Adapt pre-trained models to your specific repository
- Popular Repos Discovery: Automatically fetch the most popular repositories by language
- Language Agnostic: Support for any programming language (Java, Python, JavaScript, etc.)
- Smart Embeddings: Generate embeddings using models trained on relevant codebases
- Go to Actions → "Train Base Doc2Vec Model"
- Configure:
- Language: Select the programming language (java, python, etc.)
- Repository count: Number of top repos to use (default: 100, max: 1000)
- File extensions: Extensions to analyze (e.g.,
.java) - Vector size: Embedding dimensions (default: 200)
- Run workflow and wait for completion
- Note the artifact name (e.g.,
base-model-java)
- Go to Actions → "Fine-tune Doc2Vec Model"
- Configure:
- Repository URL: Your target repository
- Repository version (optional): Specific commit SHA, tag, or branch to analyze
- Leave empty for latest code
- Examples:
v1.0.0,main,abc123def
- Run ID: Copy the run ID from the base model workflow URL
- Example: If base model URL is
.../actions/runs/1234567890, use1234567890
- Example: If base model URL is
- Base model artifact: Name from Step 1 (e.g.,
base-model-java-100repos) - File extensions: Extensions to analyze
- Fine-tune epochs: Training iterations (default: 10)
- Update vocabulary: Whether to add new words from target repo (default: true)
- Run workflow to generate embeddings for your repository
# Clone the repo
git clone https://github.com/yourusername/doc2vec-experiments.git
cd doc2vec-experiments
# Install dependencies
pip install -r requirements.txt
# Step 1: Get popular repositories (max 1000 due to GitHub API limits)
python src/get_popular_repos.py \
--language java \
--count 100 \
--min-stars 1000 \
--output popular_repos.txt
# Step 2: Train base model on popular repos
python src/train_base_model.py \
--repos popular_repos.txt \
--ext .java \
--output base_model.d2v \
--vector-size 200 \
--epochs 20
# Step 3: Fine-tune on your repository
python src/finetune_and_embed.py \
--repo https://github.com/your-org/your-repo.git \
--base-model base_model.d2v \
--ext .java \
--output your_repo \
--epochs 10 \
--version v1.0.0 # Optional: specific version to analyze- Trains a Doc2Vec model on multiple popular repositories
- Creates a robust base representation for code
- Outputs: base model, metadata, sample embeddings
- Takes a pre-trained base model
- Fine-tunes it on your specific repository
- Generates embeddings optimized for your codebase
- Outputs: embeddings CSV, fine-tuned model, metadata
base_model_{language}_{count}repos.d2v: Trained Doc2Vec modelbase_model_{language}_{count}repos.json: Training metadata (repos used, parameters)base_model_{language}_{count}repos.sample.csv: Sample embeddings for validationpopular_repos.txt: List of repository URLs used for training
{repo_name}_embeddings.csv: Document vectors for each source file- Column 1: Repository-relative file path
- Columns 2-201: 200-dimensional embedding vectors (or as configured)
{repo_name}_finetuned.d2v: Fine-tuned Doc2Vec model{repo_name}_metadata.json: Fine-tuning metadata including:- Base model used
- Repository URL and version (if specified)
- Total documents processed
- Vector size and vocabulary size
Default training parameters:
- Vector size: 200 dimensions
- Window size: 5 tokens
- Min count: 3 (minimum word frequency)
- Training epochs: 20
- Model: DM (Distributed Memory)
- Code similarity: Find similar code files across projects
- Code search: Semantic search through codebases
- Code classification: Classify code by functionality or quality
- Technical debt analysis: Identify problematic code patterns
- Cross-project analysis: Compare coding styles across repositories
doc2vec-experiments/
├── .github/
│ └── workflows/
│ ├── train-base-model.yaml # Train base model on popular repos
│ └── finetune-model.yaml # Fine-tune model on specific repo
├── src/
│ ├── train_base_model.py # Train single model on multiple repos
│ ├── finetune_and_embed.py # Fine-tune model and generate embeddings
│ ├── get_popular_repos.py # Fetch popular repos from GitHub API
│ └── utils.py # Shared utilities for all scripts
├── requirements.txt # Python dependencies
└── README.md
- Python 3.10+
- Dependencies:
gensim,pandas,tqdm,scikit-learn,requests - Git (for repository cloning)
- GitHub API Rate Limits:
- Maximum 1000 repositories per search query due to GitHub Search API limitations
- To train on more repos, consider lowering
min_starsparameter or using different date ranges
- Processing Time:
- Training on hundreds of repositories can take several hours
- GitHub Actions has a 6-hour timeout for workflows
- Memory Usage:
- Large models with many repositories require significant RAM
- The workflow automatically sets up swap space for large training jobs
Contributions are welcome! Feel free to:
- Add support for more languages
- Improve tokenization strategies
- Add visualization tools
- Enhance the training pipeline
MIT License - see LICENSE file for details