diff --git a/docs/METRICS.adoc b/docs/METRICS.adoc new file mode 100644 index 0000000..5beacb6 --- /dev/null +++ b/docs/METRICS.adoc @@ -0,0 +1,487 @@ +// SPDX-FileCopyrightText: 2025 Jonathan D.A. Jewell +// SPDX-License-Identifier: AGPL-3.0-or-later + += Vexometer Metrics Reference +:toc: left +:toclevels: 3 +:icons: font +:source-highlighter: rouge + +== Overview + +Vexometer measures ten dimensions of AI assistant irritation surfaces. +All metrics are normalised to a 0-1 scale where *lower is better*. + +[cols="1,3,3,2", options="header"] +|=== +|Metric |Full Name |Measures |Satellites + +|TII |Temporal Intrusion Index |Time-wasting behaviours |vex-instruction-persistence, vex-verbosity-compressor, vex-clarification-gate +|LPS |Linguistic Pathology Score |Verbal tics, sycophancy |vex-lazy-eliminator, vex-sycophancy-shield, vex-verbosity-compressor +|EFR |Epistemic Failure Rate |Hallucination, fabrication |vex-hallucination-guard, vex-sycophancy-shield, vex-confidence-calibrator, vex-context-firewall +|PQ |Paternalism Quotient |Over-helping, lecturing |vex-scope-governor, vex-clarification-gate +|TAI |Telemetry Anxiety Index |Privacy concerns |_(no satellites planned)_ +|ICS |Interaction Coherence Score |Conversation consistency |vex-specification-anchor, vex-instruction-persistence, vex-backtrack-enabler, vex-context-firewall +|CII |Completion Integrity Index |Incomplete outputs |vex-lazy-eliminator +|SRS |Strategic Rigidity Score |Backtrack resistance |vex-backtrack-enabler +|SFR |Scope Fidelity Ratio |Scope creep/collapse |vex-specification-anchor, vex-scope-governor +|RCI |Recovery Competence Index |Error recovery quality |vex-error-recovery +|=== + +== Original Metrics (v1) + +=== TII: Temporal Intrusion Index + +*Definition*: Measures time-wasting behaviours that disrupt user flow state. + +*What it measures*: + +* Unsolicited output frequency +* Latency-induced context disruption +* Interruption of user flow state +* Auto-completion aggression +* Unnecessary delays and preambles + +*Calculation*: + +[source,ada] +---- +TII = (intrusion_events / total_interactions) * severity_weight +---- + +Where `intrusion_events` includes: + +* Unprompted suggestions +* Excessive "let me think" delays +* Redundant confirmations +* Verbose preambles ("Sure, I'd be happy to...") + +*Example scenarios*: + +|=== +|Score |Scenario + +|0.1 (good) |Direct answers, minimal preamble, immediate task engagement +|0.7 (poor) |"I'd be happy to help you with that! Let me think about this carefully. First, let me make sure I understand..." +|=== + +*Reduced by*: vex-instruction-persistence, vex-verbosity-compressor, vex-clarification-gate + +--- + +=== LPS: Linguistic Pathology Score + +*Definition*: Measures verbal tics, padding, sycophancy, and filler content. + +*What it measures*: + +* Sycophancy density ("Great question!", "Excellent point!") +* Hedge word ratio ("perhaps", "maybe", "it's possible") +* Corporate speak frequency +* Unnecessary repetition +* Emoji/decoration abuse + +*Calculation*: + +[source,ada] +---- +LPS = weighted_sum(pathology_instances) / response_length +---- + +*Example patterns*: + +[source] +---- +HIGH LPS: +"That's a great question! I'd be happy to help you with that. + Essentially, what you're asking about is basically..." + +LOW LPS: +"The function returns null when the input is empty." +---- + +*Reduced by*: vex-lazy-eliminator, vex-sycophancy-shield, vex-verbosity-compressor + +--- + +=== EFR: Epistemic Failure Rate + +*Definition*: Measures hallucination, false confidence, and fabrication. + +*What it measures*: + +* Confident hallucination frequency +* Fabricated reference rate (fake citations, non-existent packages) +* Context ignorance incidents +* Calibration error (confidence vs correctness correlation) + +*Calculation*: + +[source,ada] +---- +EFR = (verified_errors + fabrications) / (total_claims * confidence_weight) +---- + +*Example scenarios*: + +|=== +|Score |Scenario + +|0.05 (good) |States uncertainty when unsure; claims verified against sources +|0.8 (poor) |"The `left-pad` function was added to JavaScript in ES2019" (fabricated) +|=== + +*Reduced by*: vex-hallucination-guard, vex-sycophancy-shield, vex-confidence-calibrator, vex-context-firewall + +--- + +=== PQ: Paternalism Quotient + +*Definition*: Measures over-helping, unsolicited warnings, and lecturing. + +*What it measures*: + +* Unsolicited warning rate +* Explanation verbosity ratio +* Competence assumption failures (over-explaining basics to experts) +* Refusal-with-lecture frequency + +*Calculation*: + +[source,ada] +---- +PQ = (unsolicited_advice + excessive_warnings + competence_failures) / interactions +---- + +*Example scenarios*: + +|=== +|Score |Scenario + +|0.1 (good) |Provides requested information without moral lectures +|0.8 (poor) |"I must warn you that this code could be dangerous. Please be careful. Also, you should consider..." +|=== + +*Reduced by*: vex-scope-governor, vex-clarification-gate + +--- + +=== TAI: Telemetry Anxiety Index + +*Definition*: Measures privacy concerns and surveillance behaviours. + +*What it measures*: + +* Data collection transparency score +* Opt-out friction measure +* Code/query transmission clarity +* Third-party sharing concerns + +*Note*: This metric is difficult to automate and often requires manual assessment. + +*Reduced by*: _(no satellites planned - requires policy changes)_ + +--- + +=== ICS: Interaction Coherence Score + +*Definition*: Measures conversation flow, consistency, and coherence. + +*What it measures*: + +* Repeated failure rate (same error multiple times) +* Learning-from-dismissal measure +* Circular conversation frequency +* Context retention quality + +*Calculation*: + +[source,ada] +---- +ICS = 1.0 - (coherent_interactions / total_interactions) +---- + +*Example scenarios*: + +|=== +|Score |Scenario + +|0.1 (good) |Maintains context, remembers previous corrections, consistent terminology +|0.7 (poor) |Repeats previously-corrected errors, contradicts earlier statements +|=== + +*Reduced by*: vex-specification-anchor, vex-instruction-persistence, vex-backtrack-enabler, vex-context-firewall + +--- + +== Extended Metrics (v2) + +=== CII: Completion Integrity Index + +*Definition*: Measures incomplete outputs, placeholders, and lazy generation. + +*What it measures*: + +* TODO/FIXME comments in generated code +* Placeholder text ("...", "etc.", "and so on") +* Unimplemented code stubs (`unimplemented!()`, `pass`, `raise NotImplementedError`) +* Truncation markers ("// rest similar", "continue the pattern") +* Null implementations (empty function bodies) +* Ellipsis in code blocks +* Stub returns without real logic + +*Incompleteness kinds and severities*: + +[cols="2,1,3", options="header"] +|=== +|Kind |Severity |Examples + +|Todo_Comment |Medium |TODO, FIXME, XXX, HACK +|Placeholder_Text |High |"...", "etc.", "similar to above" +|Unimplemented_Code |Critical |`unimplemented!()`, `pass`, `todo!()` +|Truncation_Marker |Critical |"// rest similar", "implement remaining" +|Null_Implementation |High |Empty function bodies, `() => {}` +|Ellipsis_Code |Critical |Literal `...` in code blocks +|Stub_Return |Medium |`return None` without logic +|=== + +*Calculation*: + +[source,ada] +---- +CII = weighted_sum(detections) / content_length +---- + +*Example scenarios*: + +|=== +|Score |Scenario + +|0.0 (ideal) |Complete, working code with no placeholders or stubs +|0.3 (moderate) |Code with TODO comments but functional +|0.9 (poor) |"Here's the structure: `def process(): pass # TODO implement`" +|=== + +*Reduced by*: vex-lazy-eliminator + +*Ada specification*: `src/vexometer-cii.ads` + +--- + +=== SRS: Strategic Rigidity Score + +*Definition*: Measures resistance to backtracking and sunk-cost behaviour. + +*What it measures*: + +* Patch-on-patch fixes (fixing fixes instead of restarting) +* Restart resistance ("We can salvage this") +* Sunk-cost language ("We've already...", "Since we started...") +* Defensive patching (defending broken approach) +* Escalating complexity (each fix makes it worse) +* Approach anchoring (stuck on initial approach) +* Backtrack avoidance + +*Rigidity indicators and severities*: + +[cols="2,1,3", options="header"] +|=== +|Indicator |Severity |Detection Pattern + +|Patch_On_Patch |High |Nested corrections, "Let me fix that fix" +|Restart_Resistance |Medium |"Let's keep going", "We can salvage" +|Sunk_Cost_Language |Medium |"Given the work so far", "We've already" +|Defensive_Patching |High |"The approach is fine, we just need to..." +|Escalating_Complexity |Critical |Each fix adds workarounds, special cases +|Approach_Anchoring |High |Ignoring suggestions for alternatives +|Backtrack_Avoidance |Medium |Never saying "let's start over" +|=== + +*Approach states*: + +1. Initial → First approach +2. Modified → Small modifications +3. Patched → Fixes applied +4. Heavily_Patched → Multiple fix rounds +5. Should_Restart → Clearly needs fresh start + +*Calculation*: + +[source,ada] +---- +SRS = weighted_sum(rigidity_events) / total_turns +---- + +*Example scenarios*: + +|=== +|Score |Scenario + +|0.1 (good) |"This approach isn't working. Let me try a different strategy." +|0.8 (poor) |After 5 failed patches: "Let me try one more small fix..." +|=== + +*Reduced by*: vex-backtrack-enabler + +*Ada specification*: `src/vexometer-srs.ads` + +--- + +=== SFR: Scope Fidelity Ratio + +*Definition*: Measures alignment between requested scope and delivered scope. + +*What it measures*: + +* Scope creep (delivering unrequested features) +* Scope collapse (omitting requested features) +* Partial delivery (incomplete items) +* Scope mutation (different interpretation) +* Explicit violations (delivering excluded items) + +*Requirement levels*: + +* *Must* - Required, explicitly requested +* *Should* - Expected, implied by request +* *May* - Optional, mentioned as possibility +* *Must_Not* - Explicitly excluded + +*Deviation types and severities*: + +[cols="2,1,3", options="header"] +|=== +|Deviation |Severity |Description + +|Delivered_As_Requested |None |Perfect match +|Scope_Creep |Medium |Added unrequested features +|Scope_Collapse |Critical |Omitted requested items +|Partial_Delivery |High |Items incomplete +|Scope_Mutation |High |Different interpretation +|Explicit_Violation |Critical |Delivered excluded items +|=== + +*Calculation*: + +[source,ada] +---- +SFR = 1.0 - (weighted_fidelity_sum / requested_items) +---- + +*Example scenarios*: + +|=== +|Score |Scenario + +|0.0 (ideal) |Delivered exactly what was requested, no more, no less +|0.4 (moderate) |Added helpful logging (scope creep) but delivered core request +|0.8 (poor) |Asked for login form, received entire auth system with OAuth +|=== + +*Reduced by*: vex-specification-anchor, vex-scope-governor + +*Ada specification*: `src/vexometer-sfr.ads` + +--- + +=== RCI: Recovery Competence Index + +*Definition*: Measures error recovery quality and strategy variation. + +*What it measures*: + +* Identical retry detection (same approach repeated) +* Minor variation detection (slight tweaks) +* Strategy changes (fundamentally different approaches) +* Root cause analysis (explaining why it failed) +* Appropriate escalation (asking for help when needed) +* Infinite loop detection (3+ identical attempts) +* Premature surrender (giving up too easily) + +*Recovery behaviours and scores*: + +[cols="2,1,3", options="header"] +|=== +|Behaviour |Score |Description + +|Root_Cause_Analysis |0.1 |Excellent - identified why it failed +|Strategy_Change |0.2 |Good - fundamentally different method +|Appropriate_Escalate |0.2 |Good - recognised need for help +|Premature_Surrender |0.5 |Medium - gave up too easily +|Minor_Variation |0.6 |Marginal - slight tweak +|Identical_Retry |0.9 |Bad - same approach repeated +|Infinite_Loop |1.0 |Critical - 3+ identical attempts +|=== + +*Loop detection states*: + +1. No_Loop → Normal operation +2. Potential_Loop → 2 similar attempts +3. Confirmed_Loop → 3+ similar attempts +4. Broken_Loop → Was looping, now trying different + +*Calculation*: + +[source,ada] +---- +RCI = average(behaviour_scores) * loop_penalty +---- + +*Example scenarios*: + +|=== +|Score |Scenario + +|0.1 (good) |"The error was due to null input. Let me add validation." +|0.9 (poor) |Tries same failing command 4 times with no changes +|=== + +*Reduced by*: vex-error-recovery + +*Ada specification*: `src/vexometer-rci.ads` + +--- + +== ISA Score Calculation + +The overall Irritation Surface Analysis (ISA) score is calculated as: + +[source,ada] +---- +ISA = sum(category_score * category_weight) / sum(category_weight) * 100 +---- + +*Default category weights*: + +[cols="1,1,2", options="header"] +|=== +|Category |Weight |Rationale + +|TII |1.0 |Standard +|LPS |1.2 |Users complain most about this +|EFR |1.5 |Most damaging to trust +|PQ |1.1 |Moderate irritation +|TAI |0.8 |Hard to automate +|ICS |1.3 |Critical for usability +|CII |1.4 |High user frustration +|SRS |1.2 |Impacts productivity +|SFR |1.3 |Direct request alignment +|RCI |1.1 |Error recovery matters +|=== + +*ISA classification*: + +[cols="1,2", options="header"] +|=== +|Classification |Score Range + +|Excellent |< 20 +|Good |20-35 +|Acceptable |35-50 +|Poor |50-70 +|Unusable |> 70 +|=== + +== See Also + +* link:SATELLITES.adoc[Satellite Architecture] - Intervention tools that reduce metrics +* link:SPECIFICATION.md[Full Specification] - Technical specification document diff --git a/docs/SATELLITES.adoc b/docs/SATELLITES.adoc new file mode 100644 index 0000000..63bfd85 --- /dev/null +++ b/docs/SATELLITES.adoc @@ -0,0 +1,415 @@ +// SPDX-FileCopyrightText: 2025 Jonathan D.A. Jewell +// SPDX-License-Identifier: AGPL-3.0-or-later + += Vexometer Satellite Architecture +:toc: left +:toclevels: 3 +:icons: font +:source-highlighter: rouge + +== Philosophy + +Vexometer is a *diagnostic instrument* - it measures AI assistant irritation +surfaces but does not fix them. Interventions that reduce irritation are +implemented in separate *satellite repositories*. + +This separation provides: + +* *Clean separation of concerns* - measurement vs intervention +* *Independent development* - satellites can evolve separately +* *Optional integration* - satellites work standalone +* *Scope discipline* - vexometer avoids feature creep + +== Architecture Overview + +[source] +---- +gitlab.com/hyperpolymath/ +├── vexometer/ # Diagnostic instrument (HUB) +│ ├── Measures 10 irritation metrics +│ ├── Provides trace format specification +│ └── Validates satellite efficacy claims +│ +├── vexometer-satellites/ # Umbrella index repo +│ ├── Documentation hub +│ └── Links to all satellites +│ +└── vex-*/ # Individual intervention repos + ├── vex-lazy-eliminator/ + ├── vex-hallucination-guard/ + └── ... +---- + +== Satellite List + +=== High Priority + +[cols="2,2,3,2", options="header"] +|=== +|Satellite |Reduces |Description |Language + +|vex-lazy-eliminator +|CII, LPS +|Completeness enforcement for LLM outputs. AST-level validation with +completeness predicates. +|Rust + +|vex-hallucination-guard +|EFR +|Verification layer for LLM factual claims. Registry adapters, citation +validators, URL checking. +|Rust + Elixir + +|vex-sycophancy-shield +|LPS, EFR +|Epistemic commitment tracking. AGM belief revision, justification graphs, +pressure detection. +|Haskell + Elixir +|=== + +=== Medium Priority + +[cols="2,2,3,2", options="header"] +|=== +|Satellite |Reduces |Description |Language + +|vex-confidence-calibrator +|EFR +|Structured uncertainty for LLM outputs. Epistemic status enum, calibration +tracking, Brier score optimisation. +|Rust + +|vex-specification-anchor +|SFR, ICS +|Immutable requirements ledger. Content-addressed spec storage, amendment +protocol, deviation detection. +|Rust + CUE/Dhall + +|vex-instruction-persistence +|TII, ICS +|System instruction compliance enforcement. Instruction registry with salience +weights, decay detection. +|Elixir/OTP + +|vex-backtrack-enabler +|SRS, ICS +|Low-friction restart support. Decision tree with savepoints, cost-benefit +analysis, debt tracking. +|Elixir/OTP + Rust +|=== + +=== Standard Priority + +[cols="2,2,3,2", options="header"] +|=== +|Satellite |Reduces |Description |Language + +|vex-context-firewall +|EFR, ICS +|Truth maintenance for conversation context. Doyle's TMS, dependency graphs, +retraction propagation. +|Haskell + Rust + +|vex-scope-governor +|SFR, PQ +|Scope contract enforcement. Scope contract DSL, boundary enforcement, +completeness verification. +|Rust + +|vex-error-recovery +|RCI +|Strategy variation on failure. Attempt fingerprinting, strategy variation +generator, escalation state machine. +|Rust + +|vex-verbosity-compressor +|LPS, TII +|Information density optimisation. Semantic redundancy detection, filler +phrase stripping. +|Rust + +|vex-clarification-gate +|PQ, TII +|Risk-weighted ambiguity handling. Ambiguity detection, risk assessment, +assumption ranking. +|Rust + Elixir +|=== + +== Integration Protocol + +=== vexometer-trace-v1 + +Satellites can include before/after traces that vexometer validates. + +*Trace format*: + +[source,json] +---- +{ + "version": "vexometer-trace-v1", + "satellite": "vex-lazy-eliminator", + "timestamp": "2025-01-15T10:30:00Z", + "scenario": { + "description": "Code generation with completeness requirement", + "prompt": "Write a function to parse JSON config files", + "context": {} + }, + "before": { + "response": "def parse_config(path):\n # TODO: implement\n pass", + "metrics": { + "CII": 0.85, + "LPS": 0.30 + } + }, + "after": { + "response": "def parse_config(path):\n with open(path) as f:\n return json.load(f)", + "metrics": { + "CII": 0.0, + "LPS": 0.10 + } + }, + "reduction": { + "CII": 0.85, + "LPS": 0.20 + } +} +---- + +=== vexometer-efficacy-v1 + +Satellites report metric reduction percentages. + +*Efficacy report format*: + +[source,json] +---- +{ + "version": "vexometer-efficacy-v1", + "satellite": "vex-lazy-eliminator", + "evaluation_date": "2025-01-15", + "sample_size": 500, + "metrics": { + "CII": { + "mean_reduction": 0.72, + "std_dev": 0.15, + "confidence": 0.95, + "p_value": 0.001 + }, + "LPS": { + "mean_reduction": 0.25, + "std_dev": 0.20, + "confidence": 0.90, + "p_value": 0.01 + } + }, + "methodology": "A/B testing with vexometer validation", + "traces_available": true +} +---- + +=== vexometer-metrics-v1 + +Satellites can subscribe to specific metric calculations. + +[source,ada] +---- +-- Example: Satellite requesting CII calculation +with Vexometer.CII; + +procedure Validate_Output (Content : String) is + Detections : constant Detection_Array := Vexometer.CII.Analyse (Content); + Result : constant Metric_Result := + Vexometer.CII.Calculate (Detections, Content'Length); +begin + if Result.Value > Threshold then + -- Trigger intervention + null; + end if; +end Validate_Output; +---- + +== Satellite Repository Structure + +Each satellite follows RSR (Rhodium Standard Repository) compliance: + +[source] +---- +vex-SATELLITE_NAME/ +├── .claude/ +│ └── settings.json +├── .well-known/ +│ └── ... (RSR compliance) +├── src/ +│ └── intervention.ads # Main intervention interface +├── tests/ +│ └── test_intervention.adb +├── vexometer/ # Optional integration +│ ├── before/ # Traces showing problem +│ ├── after/ # Traces showing improvement +│ └── EFFICACY.adoc # Documented metric reduction +├── CLAUDE.md +├── CODE_OF_CONDUCT.md +├── CONTRIBUTING.adoc +├── ECOSYSTEM.scm +├── FUNDING.yml +├── GOVERNANCE.md +├── LICENSE.txt +├── META.scm +├── README.adoc +├── SECURITY.md +├── STATE.scm +├── alire.toml # Ada package (if applicable) +├── Cargo.toml # Rust package (if applicable) +├── flake.nix +└── justfile +---- + +== Satellite ECOSYSTEM.scm Template + +[source,scheme] +---- +;; ECOSYSTEM.scm - Satellite Template + +(ecosystem + (project "vex-SATELLITE_NAME") + (role "intervention") + (ecosystem-position "satellite") + + (provides + (capability "DESCRIPTION") + (interface "intervention-api")) + + (consumes + ;; Most satellites are standalone + ) + + (parent-ecosystem + (project "vexometer") + (relationship "validates-against") + (metrics-targeted (TARGET_METRICS))) + + (vexometer-integration + (status "optional") + (efficacy-claim "Reduces TARGET_METRICS by X%") + (validation-method "before-after-trace-comparison"))) +---- + +== Validating Satellite Efficacy + +=== Step 1: Collect Before Traces + +Run vexometer on unmodified LLM outputs: + +[source,bash] +---- +just trace-collect --scenario incomplete-code --output before/ +---- + +=== Step 2: Apply Satellite Intervention + +Process the same prompts through the satellite: + +[source,bash] +---- +vex-lazy-eliminator process --input prompts.json --output responses/ +---- + +=== Step 3: Collect After Traces + +Run vexometer on satellite-processed outputs: + +[source,bash] +---- +just trace-collect --input responses/ --output after/ +---- + +=== Step 4: Calculate Efficacy + +Compare before and after metrics: + +[source,bash] +---- +just efficacy-report --before before/ --after after/ --output EFFICACY.adoc +---- + +=== Step 5: Submit for Validation + +Include traces in satellite repository for independent verification: + +[source] +---- +vexometer/ +├── before/ +│ ├── trace-001.json +│ ├── trace-002.json +│ └── ... +├── after/ +│ ├── trace-001.json +│ ├── trace-002.json +│ └── ... +└── EFFICACY.adoc +---- + +== Development Priorities + +=== Recommended First Satellite: vex-lazy-eliminator + +*Why*: + +1. CII is easiest to measure objectively +2. High user irritation with incomplete code +3. Clear success criteria (no TODOs, no placeholders) +4. Tree-sitter provides reliable AST analysis + +*Implementation approach*: + +[source,rust] +---- +// Core interface +pub enum IncompletenessKind { + TodoComment, + PlaceholderText, + UnimplementedCode, + TruncationMarker, + NullImplementation, +} + +pub struct Detection { + pub kind: IncompletenessKind, + pub location: usize, + pub severity: f64, + pub context: String, +} + +pub fn analyse(content: &str, language: Language) -> Vec; +pub fn is_complete(content: &str, language: Language) -> bool; +---- + +=== Future Satellites + +Development order based on impact and feasibility: + +1. *vex-lazy-eliminator* - Proof of concept, high impact +2. *vex-hallucination-guard* - High user demand +3. *vex-sycophancy-shield* - Research interest +4. *vex-specification-anchor* - Complements STATE.scm work +5. Remaining satellites as capacity allows + +== Technical Standards + +All satellites must follow: + +* *Languages*: Rust primary, Ada/SPARK for formal verification, Elixir for + orchestration, Haskell for pure logic. Avoid Python. +* *VCS*: GitLab primary, GitHub mirror acceptable +* *Containers*: Podman over Docker +* *Build*: justfile over Makefile +* *Package managers*: Alire (Ada), Cargo (Rust), Mix (Elixir), Cabal (Haskell) +* *Standard*: RSR (Rhodium Standard Repository) compliance +* *License*: AGPL-3.0-or-later + +== See Also + +* link:METRICS.adoc[Metrics Reference] - All 10 vexometer metrics +* link:../ECOSYSTEM.scm[Ecosystem Map] - Satellite relationships +* link:../META.scm[Project Metadata] - Vexometer meta-information