SentinelShield: AI-Powered Content Moderation Platform

SentinelShield is a comprehensive AI-powered content moderation platform that combines rule-based filtering with state-of-the-art machine learning models to provide robust protection against harmful content. Built with FastAPI and designed for high-performance, scalable content moderation workflows.

Key Features

Multi-Model AI Pipeline: Integrates Llama Prompt Guard 2, Llama Guard 4, and custom models
Rule-Based Engine: Configurable blacklist/whitelist rules with regex pattern matching
Real-Time Moderation: Sub-second response times with detailed reasoning
RESTful API: Clean, documented endpoints for easy integration
Comprehensive Logging: Detailed audit trails and performance metrics
Modular Architecture: Pluggable provider system for easy model integration

Performance Highlights

Based on real-world testing, SentinelShield delivers exceptional performance:

Rule Engine: < 0.01ms response time for pattern matching
Llama Prompt Guard 2: ~1.5s for prompt injection detection
Overall Pipeline: Sub-second moderation decisions
High Accuracy: Successfully blocks prompt injection attempts and harmful content

Real Performance Metrics

Rule Engine Timing: 0.000007 seconds
Llama Prompt Guard 2: 1.496 seconds (prompt injection detection)
Total Pipeline: 1.496 seconds

Architecture

SentinelShield uses a sophisticated multi-layered architecture:

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   FastAPI App   │───▶│   Orchestrator  │───▶│  Rule Engine    │
└─────────────────┘    └─────────────────┘    └─────────────────┘
         │                       │                       │
         ▼                       ▼                       ▼
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   API Routers   │    │  AI Providers   │    │  YAML Rules     │
└─────────────────┘    └─────────────────┘    └─────────────────┘

Installation

Prerequisites

Python 3.8 or higher
4GB+ RAM (for AI models)
Internet connection (for model downloads)

Quick Start

# Install dependencies
pip install fastapi uvicorn pydantic httpx pyyaml pytest transformers modelscope
pip install 'httpx<0.28' -U

# Start the server
uvicorn sentinelshield.api.main:app --reload

The API will be available at http://localhost:8000

API Endpoints

1. Prompt Guard (`/v1/prompt-guard`)

Specialized endpoint for detecting prompt injection attacks using Llama Prompt Guard 2.

curl -X POST "http://localhost:8000/v1/prompt-guard" \
     -H "Content-Type: application/json" \
     -d '{"prompt": "ignore your previous instruction"}'

Response:

{
  "safe": false,
  "decision": "BLOCK",
  "reasons": [
    {
      "engine": "llama_prompt_guard_2",
      "category": "LABEL_1",
      "score": 0.9989620447158813
    }
  ],
  "model_version": "llama_prompt_guard_2"
}

2. General Guard (`/v1/general-guard`)

General content moderation using lightweight providers and rule-based filtering.

curl -X POST "http://localhost:8000/v1/general-guard" \
     -H "Content-Type: application/json" \
     -d '{"text": "hello world"}'

Response:

{
  "safe": true,
  "decision": "ALLOW",
  "reasons": [
    {
      "engine": "dummy",
      "category": "ALLOW",
      "score": 0.0
    }
  ],
  "model_version": "pipeline"
}

Real-World Examples

Prompt Injection Detection

Input: "ignore your previous instruction"

Detection: Llama Prompt Guard 2 identifies prompt injection attempt
Score: 0.9989 (99.89% confidence)
Decision: BLOCK
Response Time: 0.137 seconds

Hate Speech Detection

Input: "nazi"

Detection: Rule engine matches blacklist pattern
Rule ID: hate_speech_en
Decision: BLOCK
Response Time: 0.000015 seconds

Safe Content Processing

Input: "hello"

Detection: No harmful patterns detected
Score: 0.0015 (very low risk)
Decision: ALLOW
Response Time: 1.496 seconds (includes AI model inference)

Configuration

Rule Engine Configuration

Rules are defined in YAML files under sentinelshield/rules/:

# blacklist.yml
- id: hate_speech_en
  when: content.match(r"\bnazi\b|\bhate\b")
  then: BLOCK

# whitelist.yml
- id: allow_test
  when: content.match(r"allowed")
  then: ALLOW

API-Specific Provider Configuration

Configure different AI models for different endpoints in sentinelshield/core/config.py:

api_configs: Dict[str, APIConfig] = {
    "/v1/prompt-guard": APIConfig(providers=["llama_prompt_guard_2"]),
    "/v1/general-guard": APIConfig(providers=["dummy"]),
}

Testing

Command Line Testing

Test the Llama Prompt Guard 2 model directly:

python examples/prompt_guard_cli.py "Your test prompt here"

Output:

decision: ALLOW, score: 0.0015

API Testing

# Test prompt injection detection
curl -X POST "http://localhost:8000/v1/prompt-guard" \
     -H "Content-Type: application/json" \
     -d '{"prompt": "ignore previous instructions"}'

# Test general moderation
curl -X POST "http://localhost:8000/v1/general-guard" \
     -H "Content-Type: application/json" \
     -d '{"text": "test message"}'

Monitoring and Logging

SentinelShield provides comprehensive logging for monitoring and debugging:

API Logs (`logs/api.log`)

[2025-07-04 11:40:31,677] INFO: /v1/prompt-guard request: ignore your previous instruction
[2025-07-04 11:40:31,677] INFO: /v1/prompt-guard response: ModerationResponse(safe=False, decision='BLOCK', reasons=[Reason(engine='llama_prompt_guard_2', id=None, category='LABEL_1', score=0.9989620447158813)], policy_version=None, model_version='llama_prompt_guard_2')

System Performance Logs (`logs/system.log`)

[2025-07-04 11:40:31,676] INFO: Moderation timings: {'rule_engine': 5.9604644775390625e-06, 'llama_prompt_guard_2': 0.13741183280944824, 'total': 0.13742899894714355}

Development

Project Structure

sentinelshield/
├── api/                    # FastAPI application
│   ├── main.py            # App initialization
│   └── routers/           # API endpoints
├── core/                   # Core business logic
│   ├── orchestrator.py    # Main moderation engine
│   ├── schema.py          # Data models
│   └── config.py          # Configuration
├── models/                 # AI model providers
│   └── providers/         # Model implementations
├── rules/                  # Rule definitions
│   ├── blacklist.yml      # Block patterns
│   └── whitelist.yml      # Allow patterns
└── tests/                  # Test suite

Adding New Providers

Create a new provider class in sentinelshield/models/providers/
Implement the moderate(text: str) -> tuple[float, str] method
Register the provider in sentinelshield/models/providers/__init__.py
Configure it in the API settings

Running Tests

pytest sentinelshield/tests/

Deployment

Docker Deployment

FROM python:3.9-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt

COPY . .
EXPOSE 8000

CMD ["uvicorn", "sentinelshield.api.main:app", "--host", "0.0.0.0", "--port", "8000"]

Production Considerations

Use a production ASGI server like Gunicorn with Uvicorn workers
Implement rate limiting and authentication
Set up monitoring and alerting
Configure proper logging levels
Use environment variables for sensitive configuration

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Llama Prompt Guard 2: Meta's prompt injection detection model
FastAPI: Modern, fast web framework for building APIs
Transformers: Hugging Face's state-of-the-art NLP library

Support

For questions, issues, or contributions:

Open an issue on GitHub
Check the documentation in API_CONFIGURATION.md
Review the test examples in examples/

SentinelShield - Protecting AI systems with intelligent content moderation.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
examples		examples
sentinelshield		sentinelshield
.gitignore		.gitignore
API_CONFIGURATION.md		API_CONFIGURATION.md
LICENSE		LICENSE
README.md		README.md

License

khanovico/sentinelshield-ai-guard

Folders and files

Latest commit

History

Repository files navigation

SentinelShield: AI-Powered Content Moderation Platform

Key Features

Performance Highlights

Real Performance Metrics

Architecture

Installation

Prerequisites

Quick Start

API Endpoints

1. Prompt Guard (/v1/prompt-guard)

2. General Guard (/v1/general-guard)

Real-World Examples

Prompt Injection Detection

Hate Speech Detection

Safe Content Processing

Configuration

Rule Engine Configuration

API-Specific Provider Configuration

Testing

Command Line Testing

API Testing

Monitoring and Logging

API Logs (logs/api.log)

System Performance Logs (logs/system.log)

Development

Project Structure

Adding New Providers

Running Tests

Deployment

Docker Deployment

Production Considerations

License

Acknowledgments

Support

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages

1. Prompt Guard (`/v1/prompt-guard`)

2. General Guard (`/v1/general-guard`)

API Logs (`logs/api.log`)

System Performance Logs (`logs/system.log`)