A full-stack machine learning application that predicts startup success using over 50,000+ company data points spanning 1990-2015. Built with peer reviewed academic validation methodology and powered by XGBoost. Prior to full-stack implementation, comprehensive analysis was conducted through five documented notebooks: exploratory data analysis, preprocessing and feature engineering, modeling development, performance evaluation, and production pipeline setup.
- Frontend, Backend, Data, & Notebook READMEs (More Detail & Visual Examples)
- Overview
- Why Did I Build This?
- Key Features
- System Architecture
- Demo GIFs
- Technology Stack
- Project Structure
- Quick Start
- Notebooks
- Methodology & Academic Foundation
- Overall Model Performances
- Use Cases
- API Documentation
- Academic Context
- Contributing
- License
- Author
- Acknowledgments & References
For more comprehensive, specific, and thorough documentation and examples:
This project implements and extends the bias free startup success prediction methodology from Ε»bikowski & Antosiuk (2021). This repository provides:
- Machine Learning Models: XGBoost, Logistic Regression, and SVM with documentation, analysis, and evaluation
- Interactive Web Application: React/Next.js frontend with FastAPI backend
- Model Interpretability: SHAP explanations for individual predictions
- Academic Validation: Reproduces and extends published research methodology
- F1-Score: 29.1%
- AUC-ROC: 79.0%
- Recall: 38.8%
- Precision: 23.4%
As a Statistics and Data Science student at UCSB, I wanted to create a project that goes beyond coursework. My background and interests lie around machine learning, artificial intelligence, data science, and software engineering. I set out to build something that's academically rigorous, professionally relevant, and personally meaningful.
Startups fascinate me. They combine innovation, data, and uncertainty. This is the perfect space to apply machine learning. I came across an academic paper that used a bias-free ML approach to predict startup success, and I saw an opportunity: What if I could not only replicate that research but extend it with different techniques, real world applications, and a full stack production-ready interface?
This project became my way of learning how to build an end to end machine learning pipeline, from raw data and literature review to model deployment and interactive demo. I performed exploratory data analysis, built reusable preprocessing pipelines, engineered high-value features, trained and evaluated multiple models, and explored the business implications of different success definitions. I also integrated explainable AI using SHAP, conducted temporal validation across decades, and compared academic versus venture capital perspectives on startup success.
While I had previously built full stack web applications and retrieval augmented generation (RAG) systems, this project was an opportunity to go deeper. I challenged myself to learn new tools like FastAPI for backend development, Next.js for a polished frontend, and Tailwind CSS for rapid UI design. It pushed me to improve as a student aiming to work around data, machine learning, software development, and artifical intelligence!
- 22 Engineered Features across geographic, industry, and temporal dimensions
- Bias Prevention using only founding-time information
- Cross-Validation with 5-fold stratified approach
- SHAP Integration for model interpretability
- Real-time Predictions with confidence intervals
- Interactive UI with searchable dropdowns for 750+ regions/cities
- Multi-category Selection from 15 industry categories
- Visual Explanations showing key success factors
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Next.js Frontend β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β Main β β About β β Prediction β β
β β Page β β Page β β Results β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β
β βββββββββββββββ βββββββββββββββ β
β βUser β βSHAP β β
β βInputs β βResults β β
β βββββββββββββββ βββββββββββββββ β
βββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββ
β FastAPI Backend β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β Prediction β β Explanation β β Health β β
β β Endpoint β β Endpoint β β Check β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β βData β βSHAP β βModel β β
β βPreprocessor β βExplainers β βLoader β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
βββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββ
β Machine Learning Layer β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β XGBoost β β Logistic β β SVM RBF β β
β β Model β β Regression β β Model β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
βββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββββββ
β Data Layer β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β βCrunchbase β βPreprocessed β βModel β β
β βDataset β βFeatures β βArtifacts β β
β β(50k+) β β(22 dims) β β(.pkl files) β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β
β βββββββββββββββ βββββββββββββββ β
β βCategories β βSHAP β β
β βReference β βExplainer β β
β βData β βObjects β β
β βββββββββββββββ βββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
- Python: High-level programming language for data science and machine learning
- XGBoost: Optimized gradient boosting framework for high-performance ML models
- Logistic Regression with Regularization: Linear classification algorithm with penalty terms to prevent overfitting
- SVM with RBF Kernel: Support Vector Machine using radial basis function for non-linear classification
- SHAP: Model interpretability library providing unified approach to explain predictions
- Jupyter: Interactive computing environment for data analysis and model development
- Pandas: Data manipulation and analysis library for structured data processing
- NumPy: Fundamental package for numerical computing and array operations
- Matplotlib: Comprehensive plotting library for creating static visualizations
- Seaborn: Statistical data visualization library built on matplotlib
- scikit-learn: Machine learning library with algorithms for classification, regression, and preprocessing
- React: JavaScript library for building interactive user interfaces with component-based architecture
- Next.js: Full stack React framework with server-side rendering and routing capabilities
- TypeScript: Typed superset of JavaScript providing static type checking and enhanced development experience
- Tailwind CSS: Utility first CSS framework for rapid UI development with pre-built styling classes
- FastAPI: Modern, fast web framework for building APIs with automatic documentation and type hints
- Pydantic: Data validation library using Python type annotations for request/response schemas
- Uvicorn: Lightning fast ASGI server for serving Python web applications in production
- Data Processing: Pipeline to transform user input into feature vectors for trained model inference
ML_STARTUP_SUCCESS_PREDICTOR
βββ app/
β βββ app.py
βββ data/
β βββ processed/
β βββ raw/
β βββ README.md
βββ notebooks/
β βββ 01_data_exploration.ipynb
β βββ 02_data_preprocessing.ipynb
β βββ 03_modeling.ipynb
β βββ 04_evaluation.ipynb
β βββ 05_pipeline.ipynb
β βββ README.md
βββ results/
β βββ figures/
β βββ models/
β βββ reports/
βββ src/
β βββ data_preprocessing.py
β βββ data_util.py
β βββ README.md
βββ startup-predictor/
β βββ app/
β β βββ about/
β β β βββ page.tsx
β β βββ page.tsx
β β βββ favicon.ico
β β βββ globals.css
β β βββ layout.tsx
β βββ node_modules/
β βββ public/
β β βββ file.svg
β β βββ globe.svg
β β βββ next.svg
β β βββ vercel.svg
β β βββ window.svg
β βββ styles/
β β βββ app.css
β βββ .gitignore
β βββ next.config.mjs
β βββ package.json
β βββ tailwind.config.json
β βββ tsconfig.json
β βββ README.md
βββ README.md
βββ requirements.txt
βββ LICENSE
βββ .env
βββ .gitattributes
βββ .gitignore
- Python 3.8+S
- Node.js 16+
- pip and npm
git clone https://github.com/RyanFabrick/Startup-Success-Prediction.git
cd Startup-Success-Prediction# Install Python dependencies
pip install -r requirements.txt
# Start FastAPI server
cd app
python app.py
# Server runs on http://localhost:8000# Install Node dependencies
cd startup-predictor
npm install
# Start development server
npm run dev
# Application runs on http://localhost:3000curl http://localhost:8000/healthThe application requires environment variables to be configured for proper operation.
# Environment (.env)
cp .env.example .env
# Configure settings as neededThe complete data process and analysis is documented across five notebooks:
Each notebook is self contained with thorughly detailed documentation for each step and can be run independently. Go to the Notebooks README for more information.
Based on "A machine learning, bias-free approach for predicting business success using Crunchbase data" (Ε»bikowski & Antosiuk, 2021). In my implementation I attempt to:
- Reproduces the original bias-free methodology
- Extends with enhanced feature engineering (22 vs 8 features)
- Validates across multiple economic cycles (1995-2015)
- Compares academic vs practical success definitions
- Geographic Factors (3): Region/city startup density rankings, US indicator
- Industry Categories (15): Binary encoding for major startup sectors
- Temporal Features (4): Standardized founding year, economic era classification
Academic Success: Company acquired OR (still operating AND reached Series B funding)
- Eliminates look ahead bias by using only founding time features
- Focuses on observable outcomes rather than subjective metrics
| Model | Precision | Recall | F1-Score | AUC-ROC |
|---|---|---|---|---|
| Logistic Regression | 0.169 | 0.709 | 0.273 | 0.781 |
| SVM (RBF) | 0.155 | 0.689 | 0.252 | 0.740 |
| XGBoost | 0.234 | 0.388 | 0.291 | 0.790 |
| Academic Target | 0.570 | 0.340 | 0.430 | NAN |
- Validate business ideas against historical success patterns
- Identify key risk factors before launching
- Benchmark against similar companies
- Screen opportunities with data driven insights
- Supplement due diligence with quantitative analysis
- Understand geographic and industry trends
- Academic validation of published methodologies
- Study startup ecosystem patterns
- Explore bias-free prediction techniques
POST /predict- Basic success predictionPOST /predict/explain- Prediction with SHAP explanationsGET /categories- Available industry categoriesGET /regions- Searchable region listGET /cities- Searchable city listGET /health- System status
import requests
data = {
"country_code": "USA",
"region": "SF Bay Area",
"city": "San Francisco",
"category_list": "software mobile",
"founded_year": 2010
}
response = requests.post("http://localhost:8000/predict/explain", json=data)
prediction = response.json()This project validates and extends the methodology from:
This study presents an academically and technically comprehensive machine learning approach to predict startup success while explicitly addressing the look ahead bias problem that plagues most existing research in this domain. The authors analyzed 213,171 companies from the Crunchbase database to develop practically applicable prediction models. While numerous studies have attempted to predict business success using machine learning, they typically suffer from methodological flaws that make their results impractical for actual investment decisions.
This research establishes a new standard for startup success prediction by prioritizing practical applicability over theoretical performance, providing a valuable tool for data-driven investment decisions while advancing our understanding of entrepreneurial success factors. I used it as both context and inspiration for this project!
- Independent validation using separate dataset
- Enhanced feature engineering with funding progression metrics
- Temporal robustness across multiple economic cycles
- Production deployment with interactive explanations
This project was developed as a personal learning project. For future questions and/or suggestions:
- Open an issue describing the enhancement or bug
- Fork the repository and create a feature branch
- Follow coding standards
- Write tests for new functionality
- Update documentation as needed
- Submit a pull request with detailed description of changes
This project is open source and available under the MIT License.
Ryan Fabrick
- Statistics and Data Science (B.S) Student, University of California Santa Barbara
- GitHub: https://github.com/RyanFabrick
- LinkedIn: www.linkedin.com/in/ryan-fabrick
- Email: ryanfabrick@gmail.com
- Ε»bikowski, K., & Antosiuk, P. (2021) - "A machine learning, bias-free approach for predicting business success using Crunchbase data." Information Processing and Management, 58(4), 102555
- Crunchbase - Startup and company database providing the 50,000+ company dataset for model training and validation
- XGBoost - Optimized distributed gradient boosting library where machine learning algorithims are implemented under
- scikit-learn - Machine learning library providing preprocessing, modeling, and evaluation tools including logistic regression and SVM implementations
- Logistic Regression - Linear classification algorithm using logistic function for binary and multiclass prediction with probabilistic outputs
- Support Vector Machine (SVM) with RBF Kernel - Non-linear classification algorithm using radial basis function kernel for complex decision boundaries
- SHAP - (SHapley Additive exPlanations) Model interpretability library enabling prediction explanations
- Pandas Community - Data manipulation and analysis library
- NumPy Community - Fundamental package for scientific computing
- Jupyter Project - Interactive computing environment for data analysis, processing, modeling, evaluation, and documentation
- FastAPI - Modern, fast web framework for building APIs with Python
- Uvicorn - Lightning fast ASGI server for Python web applications
- Pydantic - Data validation library using Python type annotations
- React Community - JavaScript library for building interactive user interfaces
- Next.js Community - React framework enabling full stack web applications
- Tailwind CSS - Utility first CSS framework for rapid UI development
Built with β€οΈ for the machine learning community
This personal project demonstrates my machine learning engineering skills, full stack development capabilities, and academic research validation. As a UCSB student, I designed this as an end to end showcase of my technical abilities across the complete ML pipeline - from literature review and data processing & analysis to model deployment and production ready web applications. It highlights my skills in ML algorithms, bias aware methodological design, model interpretability with SHAP, academic research validation, modern web development, and my passion for building data driven solutions and tools for entrepreneurs, investors, reseachers, and students.



