A machine learning system that accurately predicts soccer match draws and goal patterns using ensemble methods and advanced feature engineering techniques, with a focus on high-precision results for betting applications.
- Key Features
- Project Architecture
- Installation
- Usage
- Development Workflow
- Model Pipeline
- Configuration
- Extending the Model
- Troubleshooting
- Contributing
- License
- Ensemble Model Architecture: Combines XGBoost, TabNet, LightGBM, Random Forest, and KNN for robust predictions.
- Precision-Focused Weighting: Optimized weights based on each model's precision performance.
- KNN Model Support: K-Nearest Neighbors can now be used as a base model in the ensemble.
- Vectorized Threshold Optimization: Efficient threshold tuning for precision-recall balance.
- CPU-Only Optimization: Explicitly configured for deterministic CPU-based training.
- Reproducible Results: Comprehensive seed setting and environment variable control.
- MLflow Integration: Pre-trained model loading and versioned model registration.
- Modern Tooling: Uses
uvfor package management,hatchlingfor builds,rufffor linting/formatting, andMakefilefor task automation. srcLayout: Follows standard Python project structure.
The system employs a multi-stage ensemble approach. Core Python code resides in the src/ directory. Key components include:
- Data Ingestion & Preprocessing (
src/utils,data/) - Feature Engineering (
src/utils/advanced_goal_features.py) - Base Model Training (
src/models/StackedEnsemble/base/β includes XGBoost, TabNet, LightGBM, Random Forest, KNN) - Ensemble Logic (
src/models/ensemble/) - Prediction Service (
src/predictors/) - Backend API (
src/backend/- if applicable) - Documentation (
docs/,mkdocs.yml) - Development Tools (
devtools/,Makefile,pyproject.toml)
(See docs/architecture.md and docs/technical.md for more details).
- Python >=3.11,<4.0
- Git
uvpackage manager (See uv installation)make(Optional, for using the Makefile. See Makefile Setup Note)- Windows 11 (Tested on)
# Clone the repository (replace with your actual URL)
git clone https://github.com/ronyka77/TheDrawCode.git
cd TheDrawCode
# Create and activate virtual environment using uv
uv venv
# On Windows (cmd/powershell)
.venv\Scripts\activate
# On Linux/macOS
# source .venv/bin/activate
# Install dependencies using the Makefile (recommended)
make install
# OR install directly using uv
# uv sync --all-extras --dev
The Makefile provides convenient shortcuts. If make is not installed on your system (common on Windows), you can either install it (e.g., via Chocolatey: choco install make) or run the corresponding commands from the Makefile directly (e.g., run uv sync --all-extras --dev instead of make install).
Verify your installation by running the tests:
make test
# OR directly with uv
# uv run pytest# Ensure your virtual environment is activated
# Run scripts from the project root directory
# Example assumes prepare_data function exists and loads data appropriately
# Note the 'src.' prefix due to the src-layout
from src.models.ensemble.ensemble_model import EnsembleModel # Adjust filename if needed
from src.utils.logger import ExperimentLogger
import pandas as pd
# Initialize logger
logger = ExperimentLogger(experiment_name="soccer_prediction")
# Load dataset (replace with your data loading)
# data = pd.read_csv("path/to/matches.csv")
# X_train, y_train, X_test, y_test = prepare_data(data)
# # Initialize and train ensemble model (example)
# model = EnsembleModel(
# logger=logger,
# meta_learner_type='lgb',
# dynamic_weighting=True,
# target_precision=0.50,
# required_recall=0.25,
# extra_base_model_type='random_forest'
# )
# # Train the model (example)
# # results = model.train(X_train, y_train, X_test, y_test)
# # Make predictions (example)
# # predictions = model.predict(X_test)
# # probabilities = model.predict_proba(X_test)
# # print(f"Optimized threshold: {model.optimal_threshold}")
print("Usage example needs actual data loading and training steps.")# Run from project root
python -m src.models.ensemble.run_ensemble# Ensure MLflow server is configured or use local tracking
mlflow ui --port 5000 --backend-store-uri sqlite:///mlflow.db # Example using local SQLiteNavigate to http://localhost:5000 in your browser.
Use the Makefile for common tasks:
make install: Install dependencies.make lint: Runruffcheck and format.make test: Runpytesttests.make clean: Remove cache and build artifacts.make build: Build the package.
The system follows this workflow:
- Data Preparation: Feature engineering and validation.
- Base Model Loading: Loading pre-trained models from MLflow (from
src/models/ensemble/ensemble_model.py).# Example run IDs used by the system xgb_run_id = '30402608b8dc4c899d675e5b56c48c01' # ... other model run IDs ... # knn_run_id = 'your_knn_model_run_id'
- Dynamic Weighting: Calculating weights (
src/models/ensemble/weights.py). - Meta-Feature Creation.
- Meta-Learner Training.
- Threshold Optimization (
src/models/ensemble/thresholds.py). - Model Registration: Registering the final model.
Configuration is managed via:
pyproject.toml: Project metadata, dependencies, build settings, tool configurations (ruff, pytest).- Environment Variables: For reproducibility and runtime settings (see Installation section).
- Model Parameters: Passed during
EnsembleModelinitialization or viarun_ensemblescript arguments.
(See docs/technical.md for details on specific settings like reproducibility seeds and base model parameters).
- Implement the model in
src/models/StackedEnsemble/base/(see KNN as an example). - Add the model type to
extra_base_model_typeoptions insrc/models/ensemble/ensemble_model.py. - Update the
load_models_from_mlflowmethod insrc/models/ensemble/ensemble_model.py. - Register a new MLflow run ID for your trained model.
- Import Errors (
ModuleNotFoundError): Ensure you run scripts from the project root directory (TheDrawCode) or have installed the package correctly (make installoruv pip install -e .). Verify thesrclayout is correct. makenot found: See Makefile Setup Note.- MLflow Model Loading Errors: Check model registration and signatures in MLflow.
- TensorFlow Numerical Differences: Ensure
TF_ENABLE_ONEDNN_OPTS=0is set. - Memory Issues: Reduce batch sizes or feature counts.
- TabNet CPU Core Configuration: Ensure threading environment variables are set and PyTorch threads are configured if issues persist.
(See docs/technical.md for more details on specific configurations).
Contributions are welcome!
- Fork the repository.
- Create your feature branch (
git checkout -b feature/your-feature). - Make your changes.
- Ensure code quality: Run
make lintandmake test. - Commit your changes (
git commit -am 'Add some feature'). - Push to the branch (
git push origin feature/your-feature). - Open a Pull Request.
Please adhere to PEP 8, include docstrings/comments, add tests, and use type hints.
This project is licensed under the MIT License - see the LICENSE file for details.
For more detailed documentation, run mkdocs serve and view the site locally, or refer to the files in the docs/ directory.

