FullβStack Medical Data Extraction & Knowledge Graph (Streamlit + FastAPI + Neo4j)
MedForm AI is a comprehensive platform for structured medical data extraction. It ingests clinical notes from text, scanned forms, or voice recordings, uses an LLM-powered extraction pipeline, optionally enhanced with OCR or speech-to-text models, and persists structured data in Neo4j as a semantic knowledge graph. A staff-facing Streamlit console allows review, approval, visualization, and management of patient records.
- Accepts multiple input modalities: text, images, or audio recordings.
- Extracts structured fields via LLMs: patient name, age, symptoms, vitals, recommendations, and other medical data.
- Normalizes and stores extracted data as graph nodes and relationships in Neo4j.
- Provides API endpoints for querying data and rendering interactive graph visualizations.
- Staff console for reviewing, editing, approving, and managing historical records.
High-level workflow:
-
Staff / Clinician / Data Entry Staff interact with the Streamlit Frontend:
- Upload text β
/extract/text - Upload scanned images β
/extract/image(OCR) - Upload voice notes β
/extract/audio(transcription) - Review & Approve extracted data β
/graph/insert - Visualize patient graph β
/graph/patient/{pid}&/graph/patient/{pid}/image - Access dashboard and historical records β
/records/*or/patients/list
- Upload text β
-
FastAPI backend processes extraction, normalization, and graph insertion.
-
Neo4j database stores patient records as nodes and relationships for semantic querying.
Tech stack: Python 3.10+, FastAPI, Streamlit, Neo4j 5.x, Tesseract OCR, Whisper (optional), matplotlib & NetworkX for graph rendering, requests, python-dotenv.
Multi-modal extraction
- Free-text clinical notes
- Scanned forms processed via OCR
- Voice notes transcribed via Whisper or alternative STT models
Structured JSON extraction via LLM
- Schema enforced:
patient_name,age,duration,symptoms,vital_signs,recommendations - Backend normalization:
_patient_id,_ingested_at(timestamp),_symptom_list
Graph-based knowledge storage
- Nodes: Patient, Symptom, Vital, Condition, Duration, Recommendation
- Relationships: HAS_SYMPTOM, HAS_VITAL, HAS_RECOMMENDATION, HAS_DURATION, INDICATES
- Idempotent inserts: repeated submissions update existing nodes rather than creating duplicates
Staff console
- Dashboard with KPIs, symptom distribution, and age statistics
- Extraction preview, editing, and approval workflow
- Graph viewer with JSON and PNG visualization
- Export data in CSV or JSON format
Extensible & maintainable architecture
- Modular backend for extraction, normalization, graph storage, and frontend
- Clear separation of concerns enables easy extension for additional medical mappings or new input modalities
System requirements:
- Python 3.10+
- Tesseract OCR with required language packs (
engrecommended,deuoptional) ffmpegfor audio transcription- Running Neo4j instance accessible from backend
Environment variables (.env)
OPENROUTER_API_KEY= your OpenRouter API keyNEO4J_URI= bolt://localhost:7687 (or your Neo4j URI)NEO4J_USER= neo4j (or your username)NEO4J_PASS= your passwordWHISPER_MODEL= base (optional for audio transcription)
Keep
.envout of version control to secure API keys and credentials.
- Clone the repository.
- Install dependencies for backend and frontend from the code files.
- Ensure Neo4j, Tesseract and ffmpeg are installed and properly configured.
- Start the backend:
uvicorn main:app --reload --host 0.0.0.0 --port 8000. - Start the frontend: navigate to
frontendfolder βstreamlit run app.py. - Set API URL in the Streamlit sidebar (default:
http://localhost:8501).
You should now have access to the full extraction, review, graph visualization, and record management functionality.
| Path | Method | Description |
|---|---|---|
/extract/text |
POST | Parse free-text clinical notes |
/extract/image |
POST | Upload image β OCR + parsing |
/extract/audio |
POST | Upload audio β transcription + parsing |
/graph/insert |
POST | Insert or update approved structured data in Neo4j |
/patients/list |
GET | Retrieve list of patient IDs (recent first) |
/graph/patient/{pid} |
GET | Get JSON representation of patient graph |
/graph/patient/{pid}/image |
GET | Get base64 PNG rendering of patient graph |
/records/all |
GET | Retrieve flattened historical records for all patients |
/records/search?name=β¦ |
GET | Search patients by name (partial or case-insensitive) |
- Extraction schema from LLM:
patient_name,age,duration,symptoms,vital_signs,recommendations - Normalized fields added by backend:
_patient_id,_ingested_at,_symptom_list - Graph nodes: Patient, Symptom, Vital, Condition, Duration, Recommendation
- Relationships: HAS_SYMPTOM, HAS_VITAL, HAS_RECOMMENDATION, HAS_DURATION, INDICATES (symptom β condition)
- Idempotent merge ensures repeated submissions update existing data instead of creating duplicates
- Keep
.envand API keys secret. - Use HTTPS in production and implement authentication and role-based access control.
- For real patient data, comply with GDPR/HIPAA: consider encryption, pseudonymization, audit logging, and data retention policies.
| Issue | Solution |
|---|---|
Missing OPENROUTER_API_KEY |
Ensure .env exists and contains a valid API key |
| OCR errors | Verify Tesseract is installed and language packs match (eng, deu) |
| Audio extraction fails | Ensure ffmpeg is installed, WHISPER_MODEL configured, and model loaded correctly |
| Neo4j connection fails | Check NEO4J_URI, credentials, and network connectivity |
| Invalid JSON from LLM | Inspect raw response; adjust prompt or fallback to alternative model |
Tip: decode base64 output from /graph/patient/{pid}/image and save as PNG for visualization.
- Add unit tests for parsing, normalization, and graph insertion.
- Improve LLM prompts or expand medical mapping rules to cover more symptoms and conditions.
- Harden production deployment with authentication, HTTPS, and audit logging.
- Add a CI pipeline: linting, type checking, tests.
- Extend schema for additional medical metadata, structured vitals, demographics, or event history.
This project is licensed under the MIT License. See the LICENSE file for details.

