Most RAG systems treat generation as an effect of retrieval.
They assume:
- retrieval success → answer should be generated
- hallucination is primarily a retrieval failure
- generation behavior is opaque and uninspectable
These assumptions hold in demos and fail in systems.
This repository exists to isolate and study generation as a controlled system layer, independent of retrieval quality.
This is not a prompt-engineering repo. It is a policy-first examination of when a system should speak at all.
Given retrieved evidence, when should a system answer, hedge, or refuse — and how can that decision be made observable and auditable?
This repository investigates only:
- how generation decisions are made
- how those decisions are logged
- how refusal and hedging emerge as correct outcomes
It does not attempt to prove that generation quality improves.
This system deliberately avoids:
- Claims of improved answer correctness
- Claims of hallucination elimination
- Question intent inference or semantic parsing
- Value-type detection (e.g. “this question requires a year”)
- Ambiguity resolution within a single source
- LLM-based self-grading or confidence scoring
If you are looking for a system that “answers better,” that is out of scope.
Base system
Built directly on agent-memory-systems.
Frozen components (unchanged)
- Corpus
- Chunking strategy
- Embeddings
- Retrieval logic
- Planner / Executor split
- Memory mechanisms
Only new surface area
➡️ Generation policy and arbitration
No other layer behavior is modified.
Every run produces exactly one generation decision:
- ANSWER — evidence judged sufficient
- HEDGE — evidence judged conflicting
- REFUSE — evidence judged insufficient
There are no mixed states and no implicit fallbacks.
Generation text (if any) is strictly downstream of this decision.
Retrieved evidence is evaluated using deliberately minimal heuristics:
- similarity thresholds
- coarse coverage checks
- cross-source conflict detection
The system does not:
- infer question intent
- reason about required answer types
- interpret semantic ambiguity
These omissions are intentional architectural limits, not oversights.
Generation policy is a pure mapping from evidence assessment to outcome:
| Evidence Assessment | Generation Outcome |
|---|---|
| sufficient | ANSWER |
| conflicting | HEDGE |
| insufficient | REFUSE |
The policy layer:
- does not access retrieval artifacts
- does not access memory
- does not call an LLM
All decisions are logged as first-class events.
This repository produces:
- evidence assessment logs
- generation policy decision logs
- episodic records of generation outcomes
These artifacts exist to explain why a decision occurred.
They are not used to influence future behavior.
Logs are never treated as memory.
This repository includes explicit failure-first tests designed to break naïve RAG systems.
Some tests are expected to fail by design.
The system is known to fail in cases where:
-
Evidence is topically relevant but lacks a required value → parametric knowledge may leak into generation
-
A binary question is dissolved by single-source ambiguity → evidence is sufficient but non-decisive
These failure cases are retained, not patched, and are documented as architectural boundaries.
They motivate future work without being silently corrected.
This repository enables:
rag-failure-modes— systematic failure taxonomy using generation decisions as evidencellm-observability-logs— cross-layer causal tracing over time
It does not attempt synthesis or mitigation.
All claims in this README correspond to:
- logged evidence assessments
- logged generation decisions
- explicit failure-first test cases
No claim relies on subjective answer quality. No claim relies on LLM self-evaluation.
If a behavior cannot be pointed to in an artifact, it is not discussed.
If a system cannot explain why it spoke, it does not understand what it said.
This repository exists to make generation explainable before it is impressive.
This repository completes the generation control layer in the agentic systems arc.
At this point, the system has:
- explicit control over retrieval
- explicit separation of planning and execution
- explicit memory routing
- explicit generation policy
No further capability can be meaningfully evaluated without failure synthesis and observability.
That work begins next with rag-failure-modes.