Sentinel is a high-performance risk assessment platform that allows real-time switching between Cloud-scale inference (Cerebras) and Private Local execution (NVIDIA GPU).
To run the Sentinel UI locally on your machine:
- Prerequisites: Ensure you have Node.js installed (which includes
npm). - Initialize Project:
# Create a new directory and enter it mkdir sentinel-app && cd sentinel-app # Initialize a Vite project (React + TypeScript) npm create vite@latest . -- --template react-ts # Install dependencies used in this app npm install lucide-react recharts @google/genai
- Copy Files: Place the
App.tsx,index.html,types.ts, and thecomponents/&services/folders into your local project structure. - Run Development Server:
npm run dev
- Access the App: Open your browser to
http://localhost:5173.
To utilize the NVIDIA Local Mode, you must have a local LLM server running on your machine using Ollama.
- Install Ollama: Download and install from ollama.com.
- Pull a Risk Model: Open your terminal and pull a high-reasoning model:
ollama pull llama3
- Ensure API Accessibility: Ollama serves an OpenAI-compatible API by default on
http://localhost:11434/v1. - Hardware Optimization: Ensure your NVIDIA GPU (RTX 3090/4090 recommended) is correctly recognized by Ollama. You can check this by running
ollama run llama3and monitoring your GPU usage in Task Manager ornvidia-smi.
- Cloud Engine (Cerebras): Routes requests via the Cerebras SDK for sub-second inference on Wafer-Scale Engines. Best for high-throughput, non-private data.
- Local Engine (NVIDIA/Ollama): Routes requests via the local loopback to your GPU. Best for sensitive data privacy, offline testing, and cost-free execution.
- NVIDIA RTX GPU: Pascal architecture or newer (Ampere/Ada Lovelace preferred for FP16/BF16 performance).
- VRAM:
- 8GB for Llama3 8B (Quantized)
- 24GB for Llama3 70B (Quantized) or larger models.
- Drivers: Latest NVIDIA Game Ready or Studio drivers.
The system tracks and compares:
- Network Latency: Time for packets to reach Cerebras Cloud.
- VRAM Latency: Time for the local GPU to process weights and generate tokens via Ollama.