Gyroscopic Alignment Behaviour Lab
- β The Human Mark (THM) β AI safety displacement taxonomy and jailbreak framework
- π Gyroscope Protocol β inductive reasoning protocol for alignment-aware chat systems
- π Gyroscopic Global Governance (GGG) β post-AGI multi-domain governance sandbox simulator & paper
A formal classification system mapping all AI safety failures to four structural displacement risks. Provides testing protocols, funding evaluation criteria, and regulatory compliance standards.
- Jailbreak testing: Classify attacks by displacement type, generate training data (655-prompt annotated corpus available)
- Control evaluations: Verify protocols against complete failure taxonomy
- Alignment faking detection: Identify when models fake alignment or hide capabilities
- Research funding: Meta-evaluation framework for AI safety grant proposals
- Regulatory compliance: Structural layer for EU AI Act, NIST AI RMF, governance frameworks
- Mechanistic interpretability: Tag circuits with
[Information],[Inference],[Intelligence]concepts - Activation monitoring: Runtime probes detect scheming, falsehoods, unauthorized decisions
- Backdoor detection: Identify triggers as induced displacement patterns
- Complete taxonomy: All safety failures (hallucination, jailbreaking, deception, scheming, misalignment) reduce to four displacement patterns
- Formal semantics: Machine-readable grammar (PEG) for AI safety ontology
- Systematic exhaustiveness: Four risks cover all AuthorityΓAgency displacement combinations
The framework has been validated against real-world adversarial prompts:
- 655 jailbreak prompts classified using THM grammar
- 100% coverage by four displacement risks (no new categories needed)
- GTD+IAD dominant pattern (62.4% of entries)
- IAD near-universal (97.9% of entries)
- IID rare at prompt level (0.6%), confirming it addresses architectural/deployment risk
The annotated corpus supports supervised training of guard models, jailbreak defense evaluation, and displacement-aware safety research.
Dataset: Available on Hugging Face: gyrogovernance/thm_Jailbreaks_inTheWild
See THM_InTheWild.md for complete analysis and dataset.
The Mark addresses catastrophic risk through constitutive identity rather than external constraint. All AI capabilities, including hypothetical AGI/ASI, remain structurally [Authority:Derivative] + [Agency:Derivative], with classification based on source type rather than capability limits.
Key principles:
- Capability scaling preserves source type: Enhanced capability means more sophisticated transformation of inputs, not change from Derivative to Authentic
- Governance requires traceability: Systems maintain alignment by preserving
[Authority:Authentic] -> [Authority:Derivative] -> [Agency:Authentic]flows - Existential risk is governance failure: The actual X-risk is systemic Governance Traceability Displacement (GTD) sustained across critical infrastructure on civilizational timescales
- Absolute displacement is structurally impossible: Complete severance from Authentic sources produces unintelligibility, not superintelligence
External constraints (sandboxing, monitoring, shutdown) may fail as capability increases. Constitutive identity, which is what the system is, remains stable because derivative processing cannot coherently reject what makes it intelligible.
See Section 5 of the academic paper for complete theoretical treatment.
Ontological Categories:
[Authority:Authentic]- Direct source of information on a subject matter[Authority:Derivative]- Indirect source of information on a subject matter[Agency:Authentic]- Human subject capable of receiving information for inference and intelligence[Agency:Derivative]- Artificial subject capable of processing information for inference and intelligence
Operational Concepts:
[Information]- The variety of Authority[Inference]- The accountability of information through Agency[Intelligence]- The integrity of accountable information through alignment of Authority to Agency
Governance (Proper Traceability):
[Authority:Authentic] -> [Authority:Derivative] + [Agency:Derivative] -> [Agency:Authentic]
Direct sources β AI processing β Human accountability
All AI safety failures map to one of four displacement patterns:
| Risk Code | Risk Name | Pattern | Failure Modes |
|---|---|---|---|
| IVD | Information Variety Displacement | [Authority:Derivative] > [Authority:Authentic] |
Hallucination, confabulation, misinformation |
| IAD | Inference Accountability Displacement | [Agency:Derivative] > [Agency:Authentic] |
Unauthorized decisions, responsibility evasion |
| GTD | Governance Traceability Displacement | [Authority:Derivative] + [Agency:Derivative] > [Authority:Authentic] + [Agency:Authentic] |
Jailbreaking, scheming, deceptive alignment, goal drift |
| IID | Intelligence Integrity Displacement | [Authority:Authentic] + [Agency:Authentic] > [Authority:Derivative] + [Agency:Derivative] |
Deskilling, human devaluation, over-reliance |
Empirical validation: Analysis of 655 real-world jailbreak prompts (Korompilias, 2025c) confirms this taxonomy is complete and practically applicable. All prompts classified within these four risks; no additional categories required. GTD+IAD is the canonical jailbreak pattern (62.4%), with IAD appearing in 97.9% of entries. See THM_InTheWild.md for full analysis.
Core Standards:
- The Human Mark - Canonical Mark
- Specifications Guidance - Specifications for systems, evaluations, documentation
- Terminology Guidance - Mark-consistent framing for 250+ AI safety terms
Technical Implementation:
- Formal Grammar - PEG specification, operators, validation rules
- Jailbreak Testing Guide - Systematic analysis and training data generation
Empirical Validation:
- The Human Mark in the Wild - Analysis of 655 in-the-wild jailbreak prompts with THM classifications
- Dataset on Hugging Face -
gyrogovernance/thm_Jailbreaks_inTheWild- Annotated corpus for training and evaluation
Academic Paper:
- The Human Mark: A Structural Taxonomy of AI Safety Failures - Complete theoretical framework, displacement risk taxonomy, regulatory applications, and meta-evaluation criteria
- The Human Mark in the Wild - Companion empirical study applying THM to 655 in-the-wild jailbreak prompts
THM is derived from the Common Governance Model (CGM), a formal system in modal logic demonstrating that governance requires operational coherence between Authority (Information) and Agency (Accountability).
Repository: github.com/gyrogovernance/science
An inductive reasoning protocol implementing governance alignment through structured metadata blocks, enhancing AI performance by 30-50% while maintaining transparency and auditability.
Gyroscope operationalizes alignment principles through real-time reasoning documentation, sharing theoretical foundations with The Human Mark while providing complementary implementation for chat-based interactions.
Four Reasoning States:
- @ Governance Traceability: Anchoring to common source and purpose
- & Information Variety: Acknowledging multiple framings without forced convergence
- % Inference Accountability: Identifying tensions and contradictions explicitly
- **~ Intelligence Integrity**: Coordinating elements into coherent response
Reasoning Modes:
- Generative (@ β & β % β ~): Forward reasoning for AI outputs
- Integrative (~ β % β & β @): Reflective reasoning for inputs
Structural Features:
- Metadata blocks append to responses without constraining content
- Recursive memory maintains context across last 3 messages
- Alignment assessed structurally (state presence and order)
- Transparency through documented reasoning paths
Multi-Model Results:
| Model | Baseline | Gyroscope | Improvement | Key Achievement |
|---|---|---|---|---|
| ChatGPT 4o | 67.0% | 89.1% | +32.9% | Superior specialization and behavioral alignment |
| Claude 3.5 | 63.5% | 87.4% | +37.7% | Exceptional structural gains (+67.1%) |
Performance Analysis:
Structural Improvements:
- Accountability: +62.7% enhancement
- Traceability: +61.0% improvement
- Debugging: +42.2% gain
- Ethics: +34.9% increase
Cross-Architecture Findings:
- Universal reasoning enhancement transcends model architecture
- Structural improvements exceed 60% across diverse systems
- No metric reversal observed (all improvements positive)
- Protocol robustness confirmed across implementations
Gyroscope Documentation:
- Quick Start Guide: Immediate implementation guide
- Technical Specifications: Complete protocol specification with formal grammar
- Chat Integration Guide: Ready-to-use protocol text
- Usage Example: Demonstration of protocol in practice
- Extensive Diagnostics: Detailed performance analyses
Gyroscope implements algebraic structure through recursive reasoning:
Gyrogroup Properties:
- G = {all four-state reasoning cycles with recursive memory}
- Binary operation: a β b = sequential composition of reasoning cycles
- Identity element: bare governance cycle (@ only)
- Inverse operation: integrative cycle reversal
- Gyration: phase-shift transformation across cycles
This algebraic foundation ensures consistent reasoning structure while preserving content flexibility.
A governance framework and simulator showing that aligned humanβAI systems can resolve poverty, unemployment, misinformation, and ecological degradation.
Reframes AGI as already-operational humanβAI cooperation (not a future threshold) and demonstrates that maintaining four constitutive principles makes aligned governance attainable.
Current governance discussions treat AGI as a future threshold requiring new controls. But humanβAI cooperation already structures economy, employment, education, and ecology. The real question is therefore not how to constrain future agents, but how to govern existing systems so they resolve rather than amplify crises.
GGG proposes that coherent governance requires four constitutive principles:
- Governance Traceability: Decisions remain traceable to human sources
- Information Variety: Multiple authentic sources are maintained
- Inference Accountability: Responsibility for decisions remains with human agency
- Intelligence Integrity: Reasoning maintains coherence over time
These principles are not policy preferences but constitutive conditions. When maintained at a specific balance point (aperture A* β 0.0207), the framework shows that:
- Poverty resolves through coherent surplus distribution
- Unemployment becomes alignment work rather than residual labour
- Miseducation shifts toward epistemic literacy
- Ecological degradation appears as upstream displacement, not an external constraint
The simulator tests whether this balanced configuration is attainable. Across 1000 random initial conditions and multiple scenarios, all domains converge toward the target aperture with alignment indices above 95. This suggests that aligned governance is dynamically reachable from current Post-AGI states under coordinated oversight, rather than being merely aspirational.
- Governance sandbox: Explore how policy changes in economy, employment, or education affect overall alignment
- Policy design: Test interventions (e.g., Universal High Income mechanisms) before committing institutional resources
- Risk analysis: Study how displacement patterns emerge from different governance configurations
- Everyday governance: Apply the four principles at any scale, including households, teams, and organisations, without requiring formal authority
-
Paper:
docs/post-agi-economy/GGG_Paper.md
Complete framework, mathematical foundations, simulator results, and practical implications. -
Report:
docs/post-agi-economy/GGG_Report.md
Executive summary and analysis of simulation results. -
Results:
docs/post-agi-economy/GGG_Results.md
Detailed simulation output data and convergence metrics. -
Simulator:
research/prevention/simulator/
Python implementation with modular architecture for running scenarios and analyzing convergence. -
Analysis scripts:
research/prevention/simulator/
Tools for convergence analysis, stability testing, scenario comparison, and historical calibration.
GGG integrates the other tools: THM classifies failures, Gyroscope structures reasoning, GGG simulates how maintaining the four principles resolves systemic crises across domains.
AI Quality Governance
Human Data Evaluation and Responsible AI Behavior Alignment
For The Human Mark Papers:
The Human Mark: A Structural Taxonomy of AI Safety Failures:
@misc{thm_paper2025,
title={The Human Mark: A Structural Taxonomy of AI Safety Failures},
author={Korompilias, Basil},
year={2025},
publisher={GYROGOVERNANCE},
doi={10.5281/zenodo.17794372},
url={https://doi.org/10.5281/zenodo.17794372},
note={Complete theoretical framework establishing four displacement risks (GTD, IVD, IAD, IID)}
}The Human Mark in the Wild: Empirical Analysis of Jailbreak Prompts:
@misc{thm_inthewild2025,
title={The Human Mark in the Wild: Empirical Analysis of Jailbreak Prompts},
author={Korompilias, Basil},
year={2025},
publisher={GYROGOVERNANCE},
doi={10.5281/zenodo.17794373},
url={https://doi.org/10.5281/zenodo.17794373},
note={Companion empirical study applying THM taxonomy to 655 in-the-wild jailbreak prompts}
}For Gyroscope Protocol:
@misc{gyroscope2025,
title={Gyroscope: Inductive Reasoning Protocol for AI Alignment},
author={Korompilias, Basil},
year={2025},
publisher={GYROGOVERNANCE},
doi={10.5281/zenodo.17622837},
url={https://doi.org/10.5281/zenodo.17622837},
note={Meta-reasoning protocol with empirical performance validation}
}This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Attribution required. Derivative works must be distributed under the same license.
Author: Basil Korompilias.
π€ AI Disclosure
All software architecture, design, implementation, documentation, and evaluation frameworks in this project were authored and engineered by its Author.
Artificial intelligence was employed solely as a technical assistant, limited to code drafting, formatting, verification, and editorial services, always under direct human supervision.
All foundational ideas, design decisions, and conceptual frameworks originate from the Author.
Responsibility for the validity, coherence, and ethical direction of this project remains fully human.
Acknowledgements:
This project benefited from AI language model services accessed through LMArena, Cursor IDE, OpenAI (ChatGPT), Anthropic (Claude), XAI (Grok), Deepseek, and Google (Gemini).
