Given a work of literary fiction, can we build a model to identify whether statements about specific characters are consistent with the narrative?
This is a text classification task: for each statement and character pair, predict whether the statement contradicts established facts in the story, or aligns with them.
Domain: Natural Language Understanding, Literary Reasoning
Task: Binary Classification (Consistent / Contradicts)
- Train/Test Split: CSV-based labeled statements with source documents (novels)
- Preprocessing: Character-level tokenization; sequential chunk processing with overlap
- Book State Encoding: Build long-form narrative representations by processing entire texts in chunks
A custom transformer-like architecture with:
- Multi-head attention-like mechanisms over token sequences
- State-mediated gating to track evolving narrative beliefs
- Sparse latent representations to capture textual patterns
- Consistency scoring based on state divergence
- Initialize model with standard PyTorch configuration
- Load training data from CSV
- Build book-level narrative states by processing source texts
- Train on labeled examples using cross-entropy loss
- Evaluate on held-out test set with threshold-based predictions
- Save results to
results.csv
Hyperparameters: 4 layers, 128 embedding dim, 4 attention heads, AdamW optimizer (lr=3e-4)
- Model is trained and persisted to
bdh_trained.pt - Decision threshold optimized on validation data
- Test predictions saved to
results.csvwith:- Prediction ID
- Binary prediction (0 = contradicts, 1 = consistent)
- Prediction rationale (minimum consistency score across chunks)
Note: This is an applied case study. Evaluation metrics and results are logged during training but not published here—focus is on demonstrating a complete ML pipeline from raw data to predictions.
.
├── README.md # This file
├── requirements.txt # Python dependencies
├── rooted_rise_model.py # Model architecture definition
├── rooted_rise_train.py # Training script
├── rooted_rise_test.py # Inference & evaluation script
│
├── data/
│ ├── train.csv # Training examples (id, content, label, book_name, char)
│ └── test.csv # Test examples (id, content, book_name, char)
│
├── Books/
│ ├── The Count of Monte Cristo.txt
│ └── In search of the castaways.txt
│
├── bdh_trained.pt # Trained model weights
├── threshold.pt # Learned decision threshold
└── results.csv # Test predictions (output)
- Framework: PyTorch (2.0+)
- Data Processing: Pandas, NumPy
- Language: Python 3.8+
- Compute: CPU or CUDA-enabled GPU
pip install -r requirements.txtpython rooted_rise_train.py- Expects:
data/train.csvand book texts inBooks/ - Outputs:
bdh_trained.pt,threshold.pt
python rooted_rise_test.py- Expects:
data/test.csv, trained model weights, book texts - Outputs:
results.csvwith predictions
- This project demonstrates a complete applied ML pipeline: data loading → model training → threshold optimization → evaluation on held-out data.
- The architecture is experimental and designed for this specific task (literary consistency classification).
- All predictions are saved to
results.csvfor post-hoc analysis and error investigation. - Book state encoding is applied on a per-text basis to provide narrative context to the classifier.
This is a portfolio case study demonstrating applied machine learning fundamentals: problem definition, data engineering, model training, and structured evaluation.