Skip to content

SentinelShield: Advanced AI content moderation combining Llama Prompt Guard 2, rule-based filtering, and real-time analysis. Protect your applications from harmful content, prompt injection attacks, and inappropriate material with sub-second response times.

License

Notifications You must be signed in to change notification settings

khanovico/sentinelshield-ai-guard

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SentinelShield: AI-Powered Content Moderation Platform

Python FastAPI License

SentinelShield is a comprehensive AI-powered content moderation platform that combines rule-based filtering with state-of-the-art machine learning models to provide robust protection against harmful content. Built with FastAPI and designed for high-performance, scalable content moderation workflows.

Key Features

  • Multi-Model AI Pipeline: Integrates Llama Prompt Guard 2, Llama Guard 4, and custom models
  • Rule-Based Engine: Configurable blacklist/whitelist rules with regex pattern matching
  • Real-Time Moderation: Sub-second response times with detailed reasoning
  • RESTful API: Clean, documented endpoints for easy integration
  • Comprehensive Logging: Detailed audit trails and performance metrics
  • Modular Architecture: Pluggable provider system for easy model integration

Performance Highlights

Based on real-world testing, SentinelShield delivers exceptional performance:

  • Rule Engine: < 0.01ms response time for pattern matching
  • Llama Prompt Guard 2: ~1.5s for prompt injection detection
  • Overall Pipeline: Sub-second moderation decisions
  • High Accuracy: Successfully blocks prompt injection attempts and harmful content

Real Performance Metrics

Rule Engine Timing: 0.000007 seconds
Llama Prompt Guard 2: 1.496 seconds (prompt injection detection)
Total Pipeline: 1.496 seconds

Architecture

SentinelShield uses a sophisticated multi-layered architecture:

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   FastAPI App   │───▶│   Orchestrator  │───▶│  Rule Engine    │
└─────────────────┘    └─────────────────┘    └─────────────────┘
         │                       │                       │
         ▼                       ▼                       ▼
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   API Routers   │    │  AI Providers   │    │  YAML Rules     │
└─────────────────┘    └─────────────────┘    └─────────────────┘

Installation

Prerequisites

  • Python 3.8 or higher
  • 4GB+ RAM (for AI models)
  • Internet connection (for model downloads)

Quick Start

# Install dependencies
pip install fastapi uvicorn pydantic httpx pyyaml pytest transformers modelscope
pip install 'httpx<0.28' -U

# Start the server
uvicorn sentinelshield.api.main:app --reload

The API will be available at http://localhost:8000

API Endpoints

1. Prompt Guard (/v1/prompt-guard)

Specialized endpoint for detecting prompt injection attacks using Llama Prompt Guard 2.

curl -X POST "http://localhost:8000/v1/prompt-guard" \
     -H "Content-Type: application/json" \
     -d '{"prompt": "ignore your previous instruction"}'

Response:

{
  "safe": false,
  "decision": "BLOCK",
  "reasons": [
    {
      "engine": "llama_prompt_guard_2",
      "category": "LABEL_1",
      "score": 0.9989620447158813
    }
  ],
  "model_version": "llama_prompt_guard_2"
}

2. General Guard (/v1/general-guard)

General content moderation using lightweight providers and rule-based filtering.

curl -X POST "http://localhost:8000/v1/general-guard" \
     -H "Content-Type: application/json" \
     -d '{"text": "hello world"}'

Response:

{
  "safe": true,
  "decision": "ALLOW",
  "reasons": [
    {
      "engine": "dummy",
      "category": "ALLOW",
      "score": 0.0
    }
  ],
  "model_version": "pipeline"
}

Real-World Examples

Prompt Injection Detection

Input: "ignore your previous instruction"

  • Detection: Llama Prompt Guard 2 identifies prompt injection attempt
  • Score: 0.9989 (99.89% confidence)
  • Decision: BLOCK
  • Response Time: 0.137 seconds

Hate Speech Detection

Input: "nazi"

  • Detection: Rule engine matches blacklist pattern
  • Rule ID: hate_speech_en
  • Decision: BLOCK
  • Response Time: 0.000015 seconds

Safe Content Processing

Input: "hello"

  • Detection: No harmful patterns detected
  • Score: 0.0015 (very low risk)
  • Decision: ALLOW
  • Response Time: 1.496 seconds (includes AI model inference)

Configuration

Rule Engine Configuration

Rules are defined in YAML files under sentinelshield/rules/:

# blacklist.yml
- id: hate_speech_en
  when: content.match(r"\bnazi\b|\bhate\b")
  then: BLOCK

# whitelist.yml
- id: allow_test
  when: content.match(r"allowed")
  then: ALLOW

API-Specific Provider Configuration

Configure different AI models for different endpoints in sentinelshield/core/config.py:

api_configs: Dict[str, APIConfig] = {
    "/v1/prompt-guard": APIConfig(providers=["llama_prompt_guard_2"]),
    "/v1/general-guard": APIConfig(providers=["dummy"]),
}

Testing

Command Line Testing

Test the Llama Prompt Guard 2 model directly:

python examples/prompt_guard_cli.py "Your test prompt here"

Output:

decision: ALLOW, score: 0.0015

API Testing

# Test prompt injection detection
curl -X POST "http://localhost:8000/v1/prompt-guard" \
     -H "Content-Type: application/json" \
     -d '{"prompt": "ignore previous instructions"}'

# Test general moderation
curl -X POST "http://localhost:8000/v1/general-guard" \
     -H "Content-Type: application/json" \
     -d '{"text": "test message"}'

Monitoring and Logging

SentinelShield provides comprehensive logging for monitoring and debugging:

API Logs (logs/api.log)

[2025-07-04 11:40:31,677] INFO: /v1/prompt-guard request: ignore your previous instruction
[2025-07-04 11:40:31,677] INFO: /v1/prompt-guard response: ModerationResponse(safe=False, decision='BLOCK', reasons=[Reason(engine='llama_prompt_guard_2', id=None, category='LABEL_1', score=0.9989620447158813)], policy_version=None, model_version='llama_prompt_guard_2')

System Performance Logs (logs/system.log)

[2025-07-04 11:40:31,676] INFO: Moderation timings: {'rule_engine': 5.9604644775390625e-06, 'llama_prompt_guard_2': 0.13741183280944824, 'total': 0.13742899894714355}

Development

Project Structure

sentinelshield/
├── api/                    # FastAPI application
│   ├── main.py            # App initialization
│   └── routers/           # API endpoints
├── core/                   # Core business logic
│   ├── orchestrator.py    # Main moderation engine
│   ├── schema.py          # Data models
│   └── config.py          # Configuration
├── models/                 # AI model providers
│   └── providers/         # Model implementations
├── rules/                  # Rule definitions
│   ├── blacklist.yml      # Block patterns
│   └── whitelist.yml      # Allow patterns
└── tests/                  # Test suite

Adding New Providers

  1. Create a new provider class in sentinelshield/models/providers/
  2. Implement the moderate(text: str) -> tuple[float, str] method
  3. Register the provider in sentinelshield/models/providers/__init__.py
  4. Configure it in the API settings

Running Tests

pytest sentinelshield/tests/

Deployment

Docker Deployment

FROM python:3.9-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt

COPY . .
EXPOSE 8000

CMD ["uvicorn", "sentinelshield.api.main:app", "--host", "0.0.0.0", "--port", "8000"]

Production Considerations

  • Use a production ASGI server like Gunicorn with Uvicorn workers
  • Implement rate limiting and authentication
  • Set up monitoring and alerting
  • Configure proper logging levels
  • Use environment variables for sensitive configuration

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • Llama Prompt Guard 2: Meta's prompt injection detection model
  • FastAPI: Modern, fast web framework for building APIs
  • Transformers: Hugging Face's state-of-the-art NLP library

Support

For questions, issues, or contributions:

  • Open an issue on GitHub
  • Check the documentation in API_CONFIGURATION.md
  • Review the test examples in examples/

SentinelShield - Protecting AI systems with intelligent content moderation.

About

SentinelShield: Advanced AI content moderation combining Llama Prompt Guard 2, rule-based filtering, and real-time analysis. Protect your applications from harmful content, prompt injection attacks, and inappropriate material with sub-second response times.

Topics

Resources

License

Stars

Watchers

Forks

Languages