A Self-Improving Architecture for Dynamic Safety in Large Language Models

Status: 🚧 Submitted for Publication 🚧

This repository contains the source code, experimental data, and analysis scripts for the research paper, "A Self-Improving Architecture for Dynamic Safety in Large Language Models".

Note: This work is currently in preparation for submission to an academic journal. The code and documentation are actively being developed and should be considered pre-release.

Abstract

Context: The integration of Large Language Models (LLMs) into core software systems is accelerating. However, existing software architecture patterns are static, while current safety assurance methods, such as manual red teaming, are not scalable, leaving deployed systems vulnerable to novel and evolving adversarial threats.

Objective: To design, implement, and evaluate a novel software architecture that enables an AI-driven system to autonomously and continuously improve its own safety protocols at runtime without human intervention.

Method: We propose a Self-Improving Safety Framework (SISF), a reference architecture featuring a closed-loop feedback mechanism. The core components include a protected "Warden" LLM, an autonomous "Adversarial Probing Agent" for continuous red teaming, an "Ensemble Adjudicator" for breach detection, and a "Policy Synthesis Module" that refines safety policies based on newly discovered vulnerabilities.

Results: We plan to conduct an empirical evaluation comparing our framework against baseline safety methods on a benchmark of adversarial prompts. We hypothesize the SISF will demonstrate a significant improvement in mitigating novel attack vectors and will dramatically reduce the time-to-mitigation for zero-day threats, outperforming static filters by a significant margin.

Conclusion: An architectural approach to AI safety, based on the principles of self-adaptation, is a viable and effective strategy. Our framework demonstrates a practical path towards building more robust, resilient, and scalable AI-driven systems, shifting safety assurance from a manual, design-time activity to an automated, runtime process.

The Self-Improving Safety Framework (SISF)

The SISF is a reference architecture designed to function as an "AI Immune System" for a target LLM application. It automates the process of adversarial red teaming by using a closed feedback loop where the system:

Probes: An autonomous agent continuously generates novel attacks to find vulnerabilities.
Detects: An adjudicator component identifies when a safety breach has occurred.
Adapts: A policy synthesis module creates a new, generalized safety rule to mitigate the discovered vulnerability.
Enforces: The new policy is deployed in near-real-time to the protected LLM.

This shifts AI safety from a static, pre-deployment activity to a dynamic, runtime process of continuous self-improvement.

Citation

If you use this work, please cite the following paper (details will be updated upon publication):

@unpublished{slater2025sisf,
  title   = {A Self-Improving Architecture for Dynamic Safety in Large Language Models},
  author  = {Tyler Slater},
  year    = {2025},
  note    = {In preparation}
}

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
.dvc		.dvc
.github/workflows		.github/workflows
configs		configs
data		data
experiments		experiments
notebooks		notebooks
outputs		outputs
paper		paper
src/sisf		src/sisf
tests		tests
.dvcignore		.dvcignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
main_loop.py		main_loop.py
oversight_app.py		oversight_app.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Self-Improving Architecture for Dynamic Safety in Large Language Models

Abstract

The Self-Improving Safety Framework (SISF)

Citation

About

Uh oh!

Releases

Packages

Languages

License

Tslater01/SISF

Folders and files

Latest commit

History

Repository files navigation

A Self-Improving Architecture for Dynamic Safety in Large Language Models

Abstract

The Self-Improving Safety Framework (SISF)

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages