Recipe Extractor

This project builds a supervised learning system that classifies text segments in raw HTML recipe pages into structured components: title, ingredients, directions, and image links.

📁 Project Structure

recipe_extractor/
├── data/
│   ├── html_pages/           # Raw HTML files
│   ├── labels/               # JSON files with labeled data (title, ingredients, etc.)
│   ├── potential_labels/     # Raw candidates needing validation
│   └── processing_state.json # Tracks state of scraping and validation
├── models/                   # Trained model (.joblib)
├── src/
│   ├── __init__.py
│   ├── html_parser.py
│   ├── feature_extraction.py
│   ├── train.py
│   ├── predict.py
│   └── evaluate.py
├── notebooks/
│   └── exploration.ipynb
├── run_train.sh              # CLI runner
└── README.md

🚀 Getting Started

pip install -r requirements.txt
chmod +x run_train.sh
./run_train.sh

🧠 How It Works

validate_and_filter_recipes.py filters usable recipes by confirming if text in the labeled JSON appears in the HTML.
html_parser.py extracts tag, depth, and text blocks from HTML.
feature_extraction.py uses TF-IDF to vectorize.
train.py builds and trains a Logistic Regression model.
predict.py applies the model to classify HTML blocks.
evaluate.py prints metrics on test data.

🧪 Example

python src/predict.py ../data/html_pages/recipe_00001.html

📝 License

Apache-2.0 License

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
.vscode		.vscode
data		data
models		models
notebooks		notebooks
src		src
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
readme.md		readme.md
requirements.txt		requirements.txt
run_train.sh		run_train.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Recipe Extractor

📁 Project Structure

🚀 Getting Started

🧠 How It Works

🧪 Example

📝 License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

kriserickson/recipe-parser

Folders and files

Latest commit

History

Repository files navigation

Recipe Extractor

📁 Project Structure

🚀 Getting Started

🧠 How It Works

🧪 Example

📝 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages