Interactive symptom-to-diagnosis chat built with a FastAPI backend (TF-IDF + SVD encoder + Torch MLP classifier) and a Vite/React frontend. It is wired to the larger Hugging Face dataset fhai50032/SymptomsDisease246k (246k symptom→disease pairs) and keeps asking the most relevant follow-up questions until it reaches an 85% confidence target. If that file is missing, it falls back to the smaller Gretel dataset.
- Preferred dataset: download to
data/symptomsDisease246k.jsonmkdir -p data curl -L 'https://huggingface.co/datasets/fhai50032/SymptomsDisease246k/resolve/main/symptomsDisease246k.json' -o data/symptomsDisease246k.json - Fallback (already small):
data/train.jsonlanddata/test.jsonlfrom GretelAI. - Pipeline: TF-IDF (1–2 grams) → TruncatedSVD (256 dims) → Normalizer → Torch MLP classifier. Per-class TF-IDF keywords drive follow-up questions; replies keep probing until ≥85% confidence.
- API:
POST /api/chatwith{ "sessionId": null | "<uuid>", "message": "<symptoms>" }returns predictions, a follow-up question (if under 85% confidence), and accuracy metrics. - To keep local training fast, set
MAX_TRAIN_SAMPLES=5000(or any limit) before starting the API; by default it trains on the full dataset.
python3 -m venv .venv
. .venv/bin/activate
pip install -r backend/requirements.txt
uvicorn backend.app:app --reload --host 0.0.0.0 --port 8000Smoke test:
curl -s -X POST http://localhost:8000/api/chat \
-H "Content-Type: application/json" \
-d '{"message":"high fever, chills, rash, pain behind my eyes"}'cd frontend
npm install # already done once, safe to re-run
npm run dev -- --host --port 5173
# optionally export VITE_API_URL=http://localhost:8000 if you change ports/hostsFeatures: chat UI with running session, top-3 diagnosis confidences with bars, the next follow-up question (until ≥85% confidence), and live model metrics.
- This is a prototype; outputs are not medical advice. Always involve a clinician, especially for urgent/ambiguous cases.
- Dataset is synthetic and small; expect biases and gaps. Consider retraining with curated clinical data and adding guardrails before any real use.
- Conversations are in-memory; restart clears sessions. Scale-out will need persistence plus authentication and audit logging.