OntoCast is a framework for extracting semantic triples (creating a Knowledge Graph) from documents using an agentic, ontology-driven approach. It combines ontology management, natural language processing, and knowledge graph serialization to turn unstructured text into structured, queryable data.
- Ontology-Guided Extraction: Ensures semantic consistency and co-evolves ontologies
- Entity Disambiguation: Resolves references across document chunks
- Multi-Format Support: Handles text, JSON, PDF, and Markdown
- Semantic Chunking: Splits text based on semantic similarity
- MCP Compatibility: Implements Model Control Protocol endpoints
- RDF Output: Produces standardized RDF/Turtle
- Triple Store Integration: Supports Neo4j (n10s) and Apache Fuseki
- Hierarchical Configuration: Type-safe configuration system with environment variable support
- CLI Parameters: Flexible command-line interface with
--skip-ontology-critiqueoption - Automatic LLM Caching: Built-in response caching for improved performance and cost reduction
- GraphUpdate Operations: Token-efficient SPARQL-based updates instead of full graph regeneration
- Budget Tracking: Comprehensive tracking of LLM usage and triple generation metrics
- Ontology Versioning: Automatic semantic versioning with hash-based lineage tracking
OntoCast can be used for:
- Knowledge Graph Construction: Build domain-specific or general-purpose knowledge graphs from documents
- Semantic Search: Power search and retrieval with structured triples
- GraphRAG: Enable retrieval-augmented generation over knowledge graphs (e.g., with LLMs)
- Ontology Management: Automate ontology creation, validation, and refinement
- Data Integration: Unify data from diverse sources into a semantic graph
uv add ontocast[doc-processing]
# or
pip install ontocastuv add "ontocast[doc-processing]"
# or
pip install "ontocast[doc-processing]"Create a .env file with your configuration:
# LLM Configuration
LLM_PROVIDER=openai
LLM_API_KEY=your-api-key-here
LLM_MODEL_NAME=gpt-4o-mini
LLM_TEMPERATURE=0.1
# Server Configuration
PORT=8999
MAX_VISITS=3
RECURSION_LIMIT=1000
ESTIMATED_CHUNKS=30
ONTOLOGY_MAX_TRIPLES=10000
# Path Configuration
ONTOCAST_WORKING_DIRECTORY=/path/to/working
ONTOCAST_ONTOLOGY_DIRECTORY=/path/to/ontologies
ONTOCAST_CACHE_DIR=/path/to/cache
# Optional: Triple Store Configuration
FUSEKI_URI=http://localhost:3032/test
FUSEKI_AUTH=admin:password
FUSEKI_DATASET=ontocast
# Optional: Skip ontology critique
SKIP_ONTOLOGY_DEVELOPMENT=false
# Optional: Maximum triples allowed in ontology graph (set empty for unlimited)
ONTOLOGY_MAX_TRIPLES=10000ontocast \
--env-path .env \
--working-directory /path/to/working \
--ontology-directory /path/to/ontologiescurl -X POST http://localhost:8999/process -F "file=@document.pdf"The OntoCast server provides the following endpoints:
-
POST /process: Process documents and extract semantic triples
curl -X POST http://localhost:8999/process -F "file=@document.pdf" -
POST /flush: Flush/clean triple store data
# Clean all datasets (Fuseki) or entire database (Neo4j) curl -X POST http://localhost:8999/flush # Clean specific Fuseki dataset curl -X POST "http://localhost:8999/flush?dataset=my_dataset"
Note: For Fuseki, you can specify a
datasetquery parameter to clean a specific dataset. If omitted, all datasets are cleaned. For Neo4j, thedatasetparameter is ignored and all data is deleted. -
GET /health: Health check endpoint
-
GET /info: Service information endpoint
OntoCast includes automatic LLM response caching to improve performance and reduce API costs. Caching is enabled by default and requires no configuration.
- Tests:
.test_cache/llm/in the current working directory - Windows:
%USERPROFILE%\AppData\Local\ontocast\llm\ - Unix/Linux:
~/.cache/ontocast/llm/(or$XDG_CACHE_HOME/ontocast/llm/)
- Faster Execution: Repeated queries return cached responses instantly
- Cost Reduction: Identical requests don't hit the LLM API
- Offline Capability: Tests can run without API access if responses are cached
- Transparent: No configuration required - works automatically
If you need to specify a custom cache directory:
from pathlib import Path
from ontocast.tool.llm import LLMTool
# Cache directory is managed automatically by Cacher
llm_tool = LLMTool.create(
config=llm_config
)OntoCast uses a hierarchical configuration system built on Pydantic BaseSettings:
| Variable | Description | Default | Required |
|---|---|---|---|
LLM_API_KEY |
API key for LLM provider | - | Yes |
LLM_PROVIDER |
LLM provider (openai, ollama) | openai | No |
LLM_MODEL_NAME |
Model name | gpt-4o-mini | No |
LLM_TEMPERATURE |
Temperature setting | 0.1 | No |
ONTOCAST_WORKING_DIRECTORY |
Working directory path | - | Yes |
ONTOCAST_ONTOLOGY_DIRECTORY |
Ontology files directory | - | No |
PORT |
Server port | 8999 | No |
MAX_VISITS |
Maximum visits per node | 3 | No |
SKIP_ONTOLOGY_DEVELOPMENT |
Skip ontology critique | false | No |
ONTOLOGY_MAX_TRIPLES |
Maximum triples allowed in ontology graph | 10000 | No |
SKIP_FACTS_RENDERING |
Skip facts rendering and go straight to aggregation | false | No |
ONTOCAST_CACHE_DIR |
Custom cache directory for LLM responses | Platform default | No |
# Fuseki (Preferred)
FUSEKI_URI=http://localhost:3032/test
FUSEKI_AUTH=admin:password
FUSEKI_DATASET=dataset_name
# Neo4j (Alternative)
NEO4J_URI=bolt://localhost:7689
NEO4J_AUTH=neo4j:password# Skip ontology critique step
ontocast --skip-ontology-critique
# Process only first N chunks (for testing)
ontocast --head-chunks 5
OntoCast supports multiple triple store backends with automatic fallback:
- Apache Fuseki (Recommended) - Native RDF with SPARQL support
- Neo4j with n10s - Graph database with RDF capabilities
- Filesystem (Fallback) - Local file-based storage
When multiple triple stores are configured, Fuseki is preferred over Neo4j.
Fuseki:
cd docker/fuseki
cp .env.example .env
# Edit .env with your values
docker compose --env-file .env fuseki up -dNeo4j:
cd docker/neo4j
cp .env.example .env
# Edit .env with your values
docker compose --env-file .env neo4j up -dSee Triple Store Setup for detailed instructions.
- Quick Start Guide - Get started quickly
- Configuration System - Detailed configuration guide
- Triple Store Setup - Triple store configuration
- User Guide - Core concepts and workflow
- API Reference - Detailed API documentation
- Automatic Versioning: Semantic version increment based on change analysis (MAJOR/MINOR/PATCH)
- Hash-Based Lineage: Git-style versioning with parent hashes for tracking ontology evolution
- Multiple Version Storage: Versions stored as separate named graphs in Fuseki triple stores
- Timestamp Tracking:
updated_atfield tracks when ontology was last modified - Smart Version Analysis: Analyzes ontology changes (classes, properties, instances) to determine appropriate version bump
- Token Efficiency: LLM outputs structured SPARQL operations (insert/delete) instead of full TTL graphs
- Incremental Updates: Only changes are generated, dramatically reducing token usage
- Structured Operations: TripleOp operations with explicit prefix declarations for precise updates
- SPARQL Generation: Automatic conversion of operations to executable SPARQL queries
- LLM Statistics: Tracks API calls, characters sent/received for cost monitoring
- Triple Metrics: Tracks ontology and facts triples generated per operation
- Summary Reports: Budget summaries logged at end of processing
- Integrated Tracking: Budget tracker integrated into AgentState for clean dependency injection
- Hierarchical Configuration: New
ToolConfigandServerConfigstructure - Environment Variables: Support for
.envfiles and environment variables - Type Safety: Full type safety with Python 3.12 union syntax
- API Key: Changed from
OPENAI_API_KEYtoLLM_API_KEYfor consistency - Dependency Injection: Removed global variables, implemented proper DI
- CLI Parameters: New
--skip-ontology-critiqueand--skip-facts-renderingparameters - RDFGraph Operations: Improved
__iadd__method with proper prefix binding - Triple Store Management: Better separation between filesystem and external stores
- Serialization Interface: Unified
serialize()method for storing Ontology and RDFGraph objects - Error Handling: Improved error handling and validation
See CHANGELOG.md for complete details.
from ontocast.config import Config
from ontocast.toolbox import ToolBox
# Load configuration
config = Config()
# Initialize tools
tools = ToolBox(config)
# Process documents
# ... (use tools for processing)# Start server with custom configuration
ontocast \
--env-path .env \
--working-directory /data/working \
--ontology-directory /data/ontologies \
--skip-ontology-critique \
--head-chunks 10We welcome contributions! Please see our Contributing Guide for details.
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
- Documentation: docs/
- Issues: GitHub Issues
- Discussions: GitHub Discussions