An advanced NFL game prediction system using machine learning models to predict game outcomes, scores, and win probabilities.
graph TD
A[Data Pipeline] --> B[Machine Learning Models]
B --> C[REST API]
C --> D[Frontend Interface]
D --> A[User Interaction]
A --> E[Real-time Predictions]
E --> D
This NFL Prediction System offers the following key features:
- Data Pipeline: Semi-Automated data collection and preprocessing from NFL APIs
- Machine Learning Models: Neural Network and Gradient Boosting models for predictions
- REST API: FastAPI-based web API for serving predictions
- Frontend Interface: React-based web interface for user interactions
- Real-time Predictions: Get predictions for upcoming NFL games
- Python 3.8+
- Node.js 14+
- pip (Python package manager)
- npm (Node package manager)
- Clone the repository:
git clone https://github.com/cjordon/NFL_ML_Predictions.git
cd NFL_ML_Predictions** 2. Install Python dependencies:
pip install -r requirements.txt** 3. Install frontend dependencies:
cd frontend
npm install
cd ..- Build the dataset:
python backend/build_csv_datasets.py --start 2018 --end 2025 --out-dir backend/data** 2. Create predictive dataset (NEW):
python build_predictive_dataset.py --data-dir data --output-dir data** 3. Train the models:
python backend/train_models.py** 4. Start the API server:
uvicorn backend.main:app --reload --port 8000** 5. Start the frontend (in a new terminal):
cd frontend
npm startThe application will be available at http://localhost:3000
| Run Date (UTC) | Dataset | Features | Home MAE / RMSE | Away MAE / RMSE | Win Brier / LogLoss / Acc | Notes |
|---|---|---|---|---|---|---|
| 2025-12-01 16:33 | 2,611 games × 136 cols | Prior efficiency diffs, player aggregates, betting lines, rest, Elo | 4.45 / 5.85 | 4.36 / 5.57 | 0.123 / 0.388 / 0.825 | GradientBoostingRegressor (scores) + CalibratedClassifierCV (wins), random_state 4211. Full ledger in docs/training_runs.md. |
To use the predictive dataset builder, you need two CSV files in your data directory:
-
play_by_play.csv: Contains NFL play-by-play data with the following key columns:game_id: Unique identifier for each gameplay_id: Unique identifier for each playseason,week,quarter: Game timing informationdown,yards_to_go,yardline_100: Situational datahome_team,away_team,posteam: Team informationplay_type: Type of play (pass, run, punt, etc.)yards_gained: Outcome of the playtouchdown,interception,fumble,sack,penalty: Binary outcome indicatorsepa: Expected Points Addedwp,wpa: Win Probability and Win Probability Added
-
player_tracking.csv: Contains player tracking data with these columns:game_id,play_id: Links to play-by-play dataplayer_id: Unique player identifierposition: Player position (QB, RB, WR, etc.)team: Player's teamx_position,y_position: Field coordinatesspeed,acceleration: Movement metricsdistance_traveled: Total distance covered during playmax_speed: Maximum speed reachedseparation_distance: Distance from nearest opponentpressure_rate: QB pressure metric (for QBs)coverage_rating: Defensive coverage metric
You can obtain this data from several sources:
- NFL's Next Gen Stats: Official player tracking data
- nflfastR: Comprehensive play-by-play data (R package, but data available as CSV)
- Pro Football Reference: Historical play-by-play data
- ESPN API: Real-time play-by-play data
- nfl-data-py: Python package for NFL data (already used in this project)
The script creates several new predictive features:
offensive_epa: Expected Points Added from the offensive team's perspectiveplay_result: Comprehensive categorization of play outcomes:touchdown,interception,fumble,sack,penaltyfirst_down,positive_gain,no_gain,negative_gain
The script generates:
nfl_games.csv: The main merged datasetdataset_summary.txt: Summary statistics and feature descriptionsbuild_predictive_dataset.log: Detailed processing log
To evaluate the predictive power of the newly generated dataset compared to original source data:
import pandas as pd
import numpy as np
from sklearn.metrics import classification_report, accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
# Load datasets
original_data = pd.read_csv('data/Nfl_data.csv') # Existing game-level data
predictive_data = pd.read_csv('data/predictive_nfl_dataset.csv') # New play-level data
print("Original dataset shape:", original_data.shape)
print("Predictive dataset shape:", predictive_data.shape)
print("\nNew features in predictive dataset:")
new_features = set(predictive_data.columns) - set(original_data.columns)
for feature in sorted(new_features):
print(f"- {feature}")# Prepare data for comparison
def prepare_game_level_data(df):
"""Aggregate play-level data to game level for fair comparison."""
if 'game_id' in df.columns and 'play_id' in df.columns:
# Play-level data - aggregate to game level
game_features = df.groupby('game_id').agg({
'offensive_epa': 'mean',
'yards_gained': 'mean',
'avg_speed': 'mean',
'explosive_plays_count': 'sum',
'success_rate': 'mean',
'touchdown': 'sum',
# Add other relevant features
}).reset_index()
# Add game outcome (you'll need to define this based on your data)
# This is a simplified example
game_features['home_won'] = np.random.choice([0, 1], size=len(game_features))
else:
# Game-level data
game_features = df.copy()
game_features['home_won'] = (game_features['point_diff'] > 0).astype(int)
return game_features
# Prepare datasets
original_games = prepare_game_level_data(original_data)
predictive_games = prepare_game_level_data(predictive_data)
# Define features for modeling
original_features = ['home_prior_pf_avg_3', 'home_prior_pa_avg_3', 'away_prior_pf_avg_3', 'away_prior_pa_avg_3']
predictive_features = ['offensive_epa', 'avg_speed', 'explosive_plays_count', 'success_rate', 'touchdown']
# Train models
def evaluate_model(X, y, feature_names, model_name):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
rf_pred = rf.predict(X_test)
rf_accuracy = accuracy_score(y_test, rf_pred)
# Logistic Regression
lr = LogisticRegression(random_state=42)
lr.fit(X_train, y_train)
lr_pred = lr.predict(X_test)
lr_accuracy = accuracy_score(y_test, lr_pred)
print(f"\n{model_name} Results:")
print(f"Random Forest Accuracy: {rf_accuracy:.3f}")
print(f"Logistic Regression Accuracy: {lr_accuracy:.3f}")
# Feature importance (Random Forest)
importance = pd.DataFrame({
'feature': feature_names,
'importance': rf.feature_importances_
}).sort_values('importance', ascending=False)
print("Top 5 Most Important Features:")
print(importance.head())
return rf_accuracy, lr_accuracy
# Compare models
print("="*50)
print("MODEL COMPARISON")
print("="*50)
# Original data model
if len(original_games) > 100 and all(col in original_games.columns for col in original_features):
X_orig = original_games[original_features].fillna(0)
y_orig = original_games['home_won']
orig_rf, orig_lr = evaluate_model(X_orig, y_orig, original_features, "Original Dataset")
# Predictive data model
if len(predictive_games) > 100 and all(col in predictive_games.columns for col in predictive_features):
X_pred = predictive_games[predictive_features].fillna(0)
y_pred = predictive_games['home_won']
pred_rf, pred_lr = evaluate_model(X_pred, y_pred, predictive_features, "Predictive Dataset")# Correlation analysis
def analyze_correlations(df, target_col='home_won'):
"""Analyze feature correlations with target variable."""
numeric_cols = df.select_dtypes(include=[np.number]).columns
correlations = df[numeric_cols].corr()[target_col].abs().sort_values(ascending=False)
print(f"\nTop 10 features correlated with {target_col}:")
print(correlations.head(10))
return correlations
# Run correlation analysis
if 'home_won' in predictive_games.columns:
pred_correlations = analyze_correlations(predictive_games)
# Feature distribution analysis
def compare_feature_distributions(orig_df, pred_df):
"""Compare feature distributions between datasets."""
common_features = set(orig_df.columns) & set(pred_df.columns)
for feature in list(common_features)[:5]: # Analyze first 5 common features
print(f"\n{feature} Statistics:")
print(f"Original - Mean: {orig_df[feature].mean():.3f}, Std: {orig_df[feature].std():.3f}")
print(f"Predictive - Mean: {pred_df[feature].mean():.3f}, Std: {pred_df[feature].std():.3f}")
compare_feature_distributions(original_games, predictive_games)This comparison framework allows you to:
- Evaluate which dataset produces more accurate predictions
- Identify the most important features for prediction
- Understand how the engineered features contribute to model performance
- Compare feature distributions and correlations
The predictive dataset should show improved performance due to the additional player tracking features and engineered variables that capture more granular aspects of game play.
NFL_ML_Predictions/
├── backend/
│ ├── data/ # Data files and datasets
│ ├── models/ # Trained ML models
│ ├── scripts/ # Utility scripts
│ ├── main.py # FastAPI application
│ ├── train_models.py # Model training script
│ └── build_csv_datasets.py # Data pipeline
├── frontend/ # React frontend application
├── build_predictive_dataset.py # NEW: Predictive dataset builder
├── requirements.txt # Python dependencies
└── README.md # This file
The backend exposes the following stable HTTP endpoints. These are the
contracts the frontend uses (via frontend/src/api/client.js). If you
deploy your own backend, ensure these paths are reachable and that CORS
is configured to allow requests from your frontend origin.
-
GET /health— Health check. Returns a detailed JSON object describing component readiness (models, dataset, metadata) and a timestamp. Useful for CI, readiness probes and UI status badges. -
POST /predict— Produce a prediction for a single scheduled game. Request body (JSON):{ "home_team": "SF", "away_team": "SEA", "season": 2025, "week": 10 }
Response (JSON): a
PredictionResponseobject includinghome_score,away_score,home_win_probability,point_diff,mode, and quality metadata such asprediction_sourceandconfidence_score. -
GET /schedule/next-week— Returns the upcoming week's schedule as an array of compact game objects:{ season, week, home_team, away_team, kickoff, venue, network, game_id }. The handler picks the next slate using kickoff timestamps when available, otherwise falls back to a calendar-aware heuristic. -
GET /history?limit=N— Recent prediction history entries (most recent first). Thelimitquery parameter bounds results; the API enforces a max to avoid accidental overload. -
GET /debug— Lightweight debug information (CORS/environment hints).
Notes:
-
Some older documentation mentions
POST /retrainorPOST /update_data. At the time of this check those administrative endpoints are not implemented inbackend/main.py(they appear in docs and hooks). The frontend client (frontend/src/api/client.js) includes a safestartTraininghelper that will POST to/retrainif present and return a graceful{status: 'unsupported'}object when the backend does not expose it. -
If you need retraining automation, use
backend/train_models.pyor thescripts/helpers to run offline retraining and then deploy the new artifacts intobackend/models/.
A short, practical guide for maintainers who want to tweak the frontend UI without hunting through the code. The paths below point to the files you will most commonly edit when making changes to branding, the stats/status page, team logos, or theme tokens.
-
Site logo & favicon
- Favicon:
frontend/index.html— change the<link rel="icon">tag.- Example: replace the inline data URL with
/favicon.icoand drop the file intofrontend/public/favicon.ico.
- Example: replace the inline data URL with
- Header / site logo:
frontend/src/components/NavBar/NavBar.jsx+frontend/src/components/NavBar/NavBar.css— the NavBar currently uses text (<h1>NFL Predict</h1>). Replace that element with an image tag (<img src="/logos/brand-logo.svg" alt="Site name" />) and add responsive CSS inNavBar.css(or your global CSS).
Quick example (NavBar.jsx):
- Add your asset at
frontend/public/logos/brand-logo.svgand then update the JSX to render an<img className="site-logo" src="/logos/brand-logo.svg" />.
- Favicon:
-
Team logos (matchups/team badges)
- Frontend source of truth:
frontend/public/myteamdescriptions.csv— a simple CSV (team_name,abbr,logo_url).PredictionContext.jsxfetches/data/myteamdescriptions.csvon mount and populatesteamsused byTeamGrid/Cardcomponents. Edit this CSV to change or point to different logo URLs. - Backend fallback:
backend/team_logo.csv— the backend schedule endpoint (/schedule/next-week) reads this file when enriching schedule rows. If you want the backend to serve embedded logo URLs, update this file instead and redeploy the backend. - Hosting logos locally: place static assets under
frontend/public/logos/and setlogo_urlto/logos/<ABBR>.svgin the CSV so the app serves them with no external dependencies.
- Frontend source of truth:
-
Stats / Status ("sts") page display
-
Primary files:
frontend/src/pages/StatsPage.jsx— page logic (data fetch + layout)frontend/src/pages/StatsPage.module.css— page-specific stylesfrontend/src/components/HistoryChart.jsx— history list/chart logicfrontend/src/components/HistoryPage.jsx— history full-page view
-
To change KPIs, card layout, or which metrics are shown: edit
StatsPage.jsx(thehydrate()function collects schedule/history/overview) and adapt theSummaryCardrenderers and CSS inStatsPage.module.css.
-
-
Team grid & per-game cards
-
Files to edit for card layout, logo placement, and prediction info:
frontend/src/components/Card/Card.jsxfrontend/src/components/Card/Card.module.cssfrontend/src/components/Card/TeamGrid.jsxfrontend/src/components/Card/TeamGrid.css
-
These controls the matchup card markup, logo image elements, kickoff formatting, and the section that renders prediction probabilities.
-
-
Theme tokens, colors, and fonts
-
Global tokens and design system variables are in:
frontend/src/styles/base.css— primary design tokens (:root) such as--c-brand-1,--font-sans,--r-md, etc. Change these to alter colors, radii, fonts, shadows, and more across the app.frontend/src/styles/theme-grid.css— component/theme helpers used by some components.
-
After changing variables in
base.css, rebuild the app to see the updated theme applied everywhere.
-
-
API base URL / dev proxy
- Dev proxy:
frontend/vite.config.js— theserver.proxysection forwards/schedule,/predict,/history,/health,/debugtohttp://127.0.0.1:8000during local development. Ensure your backend is running on port 8000 for the dev proxy to work. - Production base URL:
frontend/.env(key:VITE_API_BASE) — set this to your deployed backend (e.g.,https://nfl-predict-ecf5a5bd34fe.herokuapp.com/). The client readsimport.meta.env.VITE_API_BASEinfrontend/src/api/client.js.
- Dev proxy:
-
Charts, data formatting and date/time
- Charts and history display are rendered by
HistoryChart.jsx. To change how timestamps or percentages are formatted, update helpers in that file (e.g.,toDateOrNull,toWholePercent) or the components that consume the normalized data.
- Charts and history display are rendered by
-
Background / brand imagery
- The app background is referenced in
frontend/src/styles/base.css:background-image: url('/nfl_pic.png')— replacefrontend/public/nfl_pic.pngto change the background.
- The app background is referenced in
-
Rebuild & deploy (quick commands)
-
Local development (Vite dev server + proxy):
cd frontend npm install npm run dev
-
Production build (static assets):
cd frontend npm run build # then deploy the `frontend/dist` folder (Vercel will auto-detect)
-
The repo includes
scripts/deploy.ps1to push backend to Heroku and frontend to Vercel (it automates CORS updates and builds). See thescripts/folder for deployment helpers.
-
-
Troubleshooting & tips - If team logos do not update after changing CSV or local files, clear the browser cache or change the filename to avoid CDN cache effects. - When changing API contracts, always update
frontend/src/api/client.jsand adjustvite.config.js(proxy) andfrontend/.envaccordingly. - For accessibility changes (font sizes, color contrast), prefer token edits inbase.cssrather than in many component files.
If you'd like, I can also add short code snippets to the README for the most
common edits (e.g., replacing the header text with an <img> logo) — say
which snippets you'd like and I'll append them
backend/data/ # CSV artifacts team_game_base.csv team_game_iter3.schema.json team_game_iter3.schema.md
Please read our contributing guidelines before submitting pull requests.
This project uses a split deployment architecture:
- Backend (FastAPI): Deployed on Heroku at
https://nfl-predict-ecf5a5bd34fe.herokuapp.com - Frontend (React): Deployed on Vercel at
https://nfl-ml-predictions.vercel.app
The backend and frontend are properly configured for cross-origin requests:
-
Backend CORS: The API now ships with an explicit default CORS policy that allows the production frontend and a localhost dev origin. This makes most deployments simpler and protects users from an accidental empty ALLOWED_ORIGINS configuration.
Default allowed origins:
https://nfl-ml-predictions.vercel.apphttp://localhost:3000
These defaults may be overridden using the
ALLOWED_ORIGINSenvironment variable on Heroku if you need to add extra origins or enable broader access. For example, to explicitly set allowed origins on Heroku:heroku config:set ALLOWED_ORIGINS="https://nfl-ml-predictions.vercel.app,http://localhost:3000" -a nfl-predict -
Frontend API base: Set
VITE_API_BASEin Vercel project settings orfrontend/.env.production.Note: the frontend client prefers
VITE_API_BASE.VITE_API_URLis still recognized in some docs for backward compatibility butVITE_API_BASEis the canonical env key used byfrontend/src/api/client.js.
For detailed CORS and API configuration guide, see docs/CORS_API_CONFIGURATION.md
# Login to Heroku
heroku login
# Deploy backend
git push heroku main
# Verify deployment
heroku logs --tail -a nfl-predict
curl https://nfl-predict-ecf5a5bd34fe.herokuapp.com/health# Login to Vercel
vercel login
# Deploy frontend
cd frontend
npm run build
vercel --prodFor automated deployment, use the PowerShell deployment script:
pwsh -File scripts/deploy.ps1This script handles:
- CORS configuration on Heroku
- Frontend dependency installation and build
- Git commits and pushes
- Backend deployment to Heroku
- Frontend deployment to Vercel
- Health check verification
See DEPLOYMENT_FIXED.md for detailed deployment troubleshooting.
======= backend/scripts/ build_csvs.py # Builds the four CSVs and auto-writes schema files main.py # FastAPI service: /health, /predict, /predict_raw, /retrain train_models.py # Trains NN + GBM, writes artifacts + metadata README.md
This project is licensed under the MIT License - see the LICENSE file for details. This project is licensed under the MIT License - see the LICENSE file for details.
