This project builds a supervised learning system that classifies text segments in raw HTML recipe pages into structured components: title, ingredients, directions, and image links.
recipe_extractor/
├── data/
│ ├── html_pages/ # Raw HTML files
│ ├── labels/ # JSON files with labeled data (title, ingredients, etc.)
│ ├── potential_labels/ # Raw candidates needing validation
│ └── processing_state.json # Tracks state of scraping and validation
├── models/ # Trained model (.joblib)
├── src/
│ ├── __init__.py
│ ├── html_parser.py
│ ├── feature_extraction.py
│ ├── train.py
│ ├── predict.py
│ └── evaluate.py
├── notebooks/
│ └── exploration.ipynb
├── run_train.sh # CLI runner
└── README.md
pip install -r requirements.txt
chmod +x run_train.sh
./run_train.shvalidate_and_filter_recipes.pyfilters usable recipes by confirming if text in the labeled JSON appears in the HTML.html_parser.pyextracts tag, depth, and text blocks from HTML.feature_extraction.pyuses TF-IDF to vectorize.train.pybuilds and trains a Logistic Regression model.predict.pyapplies the model to classify HTML blocks.evaluate.pyprints metrics on test data.
python src/predict.py ../data/html_pages/recipe_00001.htmlApache-2.0 License