OntoCast

Agentic ontology-assisted framework for semantic triple extraction

Overview

OntoCast is a framework for extracting semantic triples (creating a Knowledge Graph) from documents using an agentic, ontology-driven approach. It combines ontology management, natural language processing, and knowledge graph serialization to turn unstructured text into structured, queryable data.

Key Features

Ontology-Guided Extraction: Ensures semantic consistency and co-evolves ontologies
Entity Disambiguation: Resolves references across document chunks
Multi-Format Support: Handles text, JSON, PDF, and Markdown
Semantic Chunking: Splits text based on semantic similarity
MCP Compatibility: Implements Model Control Protocol endpoints
RDF Output: Produces standardized RDF/Turtle
Triple Store Integration: Supports Neo4j (n10s) and Apache Fuseki
Hierarchical Configuration: Type-safe configuration system with environment variable support
CLI Parameters: Flexible command-line interface with --skip-ontology-critique option
Automatic LLM Caching: Built-in response caching for improved performance and cost reduction
GraphUpdate Operations: Token-efficient SPARQL-based updates instead of full graph regeneration
Budget Tracking: Comprehensive tracking of LLM usage and triple generation metrics
Ontology Versioning: Automatic semantic versioning with hash-based lineage tracking

Applications

OntoCast can be used for:

Knowledge Graph Construction: Build domain-specific or general-purpose knowledge graphs from documents
Semantic Search: Power search and retrieval with structured triples
GraphRAG: Enable retrieval-augmented generation over knowledge graphs (e.g., with LLMs)
Ontology Management: Automate ontology creation, validation, and refinement
Data Integration: Unify data from diverse sources into a semantic graph

Installation

uv add ontocast[doc-processing] 
# or
pip install ontocast

Optional features: document processing (PDFs, PPT, OCR, semantic chunking):

uv add "ontocast[doc-processing]"
# or
pip install "ontocast[doc-processing]"

Quick Start

1. Configuration

Create a .env file with your configuration:

# LLM Configuration
LLM_PROVIDER=openai
LLM_API_KEY=your-api-key-here
LLM_MODEL_NAME=gpt-4o-mini
LLM_TEMPERATURE=0.1

# Server Configuration
PORT=8999
MAX_VISITS=3
RECURSION_LIMIT=1000
ESTIMATED_CHUNKS=30
ONTOLOGY_MAX_TRIPLES=10000

# Path Configuration
ONTOCAST_WORKING_DIRECTORY=/path/to/working
ONTOCAST_ONTOLOGY_DIRECTORY=/path/to/ontologies
ONTOCAST_CACHE_DIR=/path/to/cache

# Optional: Triple Store Configuration
FUSEKI_URI=http://localhost:3032/test
FUSEKI_AUTH=admin:password
FUSEKI_DATASET=ontocast

# Optional: Skip ontology critique
SKIP_ONTOLOGY_DEVELOPMENT=false
# Optional: Maximum triples allowed in ontology graph (set empty for unlimited)
ONTOLOGY_MAX_TRIPLES=10000

2. Start Server

ontocast \
    --env-path .env \
    --working-directory /path/to/working \
    --ontology-directory /path/to/ontologies

3. Process Documents

curl -X POST http://localhost:8999/process -F "file=@document.pdf"

4. API Endpoints

The OntoCast server provides the following endpoints:

POST /process: Process documents and extract semantic triples

curl -X POST http://localhost:8999/process -F "file=@document.pdf"

POST /flush: Flush/clean triple store data
```
# Clean all datasets (Fuseki) or entire database (Neo4j)
curl -X POST http://localhost:8999/flush

# Clean specific Fuseki dataset
curl -X POST "http://localhost:8999/flush?dataset=my_dataset"
```
Note: For Fuseki, you can specify a dataset query parameter to clean a specific dataset. If omitted, all datasets are cleaned. For Neo4j, the dataset parameter is ignored and all data is deleted.
GET /health: Health check endpoint
GET /info: Service information endpoint

LLM Caching

OntoCast includes automatic LLM response caching to improve performance and reduce API costs. Caching is enabled by default and requires no configuration.

Cache Locations

Tests: .test_cache/llm/ in the current working directory
Windows: %USERPROFILE%\AppData\Local\ontocast\llm\
Unix/Linux: ~/.cache/ontocast/llm/ (or $XDG_CACHE_HOME/ontocast/llm/)

Benefits

Faster Execution: Repeated queries return cached responses instantly
Cost Reduction: Identical requests don't hit the LLM API
Offline Capability: Tests can run without API access if responses are cached
Transparent: No configuration required - works automatically

Custom Cache Directory

If you need to specify a custom cache directory:

from pathlib import Path
from ontocast.tool.llm import LLMTool

# Cache directory is managed automatically by Cacher
llm_tool = LLMTool.create(
    config=llm_config
)

Configuration System

OntoCast uses a hierarchical configuration system built on Pydantic BaseSettings:

Environment Variables

Variable	Description	Default	Required
`LLM_API_KEY`	API key for LLM provider	-	Yes
`LLM_PROVIDER`	LLM provider (openai, ollama)	openai	No
`LLM_MODEL_NAME`	Model name	gpt-4o-mini	No
`LLM_TEMPERATURE`	Temperature setting	0.1	No
`ONTOCAST_WORKING_DIRECTORY`	Working directory path	-	Yes
`ONTOCAST_ONTOLOGY_DIRECTORY`	Ontology files directory	-	No
`PORT`	Server port	8999	No
`MAX_VISITS`	Maximum visits per node	3	No
`SKIP_ONTOLOGY_DEVELOPMENT`	Skip ontology critique	false	No
`ONTOLOGY_MAX_TRIPLES`	Maximum triples allowed in ontology graph	10000	No
`SKIP_FACTS_RENDERING`	Skip facts rendering and go straight to aggregation	false	No
`ONTOCAST_CACHE_DIR`	Custom cache directory for LLM responses	Platform default	No

Triple Store Configuration

# Fuseki (Preferred)
FUSEKI_URI=http://localhost:3032/test
FUSEKI_AUTH=admin:password
FUSEKI_DATASET=dataset_name

# Neo4j (Alternative)
NEO4J_URI=bolt://localhost:7689
NEO4J_AUTH=neo4j:password

CLI Parameters

# Skip ontology critique step
ontocast --skip-ontology-critique

# Process only first N chunks (for testing)
ontocast --head-chunks 5

Triple Store Setup

OntoCast supports multiple triple store backends with automatic fallback:

Apache Fuseki (Recommended) - Native RDF with SPARQL support
Neo4j with n10s - Graph database with RDF capabilities
Filesystem (Fallback) - Local file-based storage

When multiple triple stores are configured, Fuseki is preferred over Neo4j.

Quick Setup with Docker

Fuseki:

cd docker/fuseki
cp .env.example .env
# Edit .env with your values
docker compose --env-file .env fuseki up -d

Neo4j:

cd docker/neo4j
cp .env.example .env
# Edit .env with your values
docker compose --env-file .env neo4j up -d

See Triple Store Setup for detailed instructions.

Documentation

Quick Start Guide - Get started quickly
Configuration System - Detailed configuration guide
Triple Store Setup - Triple store configuration
User Guide - Core concepts and workflow
API Reference - Detailed API documentation

Recent Changes

Ontology Management Improvements

Automatic Versioning: Semantic version increment based on change analysis (MAJOR/MINOR/PATCH)
Hash-Based Lineage: Git-style versioning with parent hashes for tracking ontology evolution
Multiple Version Storage: Versions stored as separate named graphs in Fuseki triple stores
Timestamp Tracking: updated_at field tracks when ontology was last modified
Smart Version Analysis: Analyzes ontology changes (classes, properties, instances) to determine appropriate version bump

GraphUpdate System

Token Efficiency: LLM outputs structured SPARQL operations (insert/delete) instead of full TTL graphs
Incremental Updates: Only changes are generated, dramatically reducing token usage
Structured Operations: TripleOp operations with explicit prefix declarations for precise updates
SPARQL Generation: Automatic conversion of operations to executable SPARQL queries

Budget Tracking

LLM Statistics: Tracks API calls, characters sent/received for cost monitoring
Triple Metrics: Tracks ontology and facts triples generated per operation
Summary Reports: Budget summaries logged at end of processing
Integrated Tracking: Budget tracker integrated into AgentState for clean dependency injection

Configuration System Overhaul

Hierarchical Configuration: New ToolConfig and ServerConfig structure
Environment Variables: Support for .env files and environment variables
Type Safety: Full type safety with Python 3.12 union syntax
API Key: Changed from OPENAI_API_KEY to LLM_API_KEY for consistency
Dependency Injection: Removed global variables, implemented proper DI

Enhanced Features

CLI Parameters: New --skip-ontology-critique and --skip-facts-rendering parameters
RDFGraph Operations: Improved __iadd__ method with proper prefix binding
Triple Store Management: Better separation between filesystem and external stores
Serialization Interface: Unified serialize() method for storing Ontology and RDFGraph objects
Error Handling: Improved error handling and validation

See CHANGELOG.md for complete details.

Examples

Basic Usage

from ontocast.config import Config
from ontocast.toolbox import ToolBox

# Load configuration
config = Config()

# Initialize tools
tools = ToolBox(config)

# Process documents
# ... (use tools for processing)

Server Usage

# Start server with custom configuration
ontocast \
    --env-path .env \
    --working-directory /data/working \
    --ontology-directory /data/ontologies \
    --skip-ontology-critique \
    --head-chunks 10

Contributing

We welcome contributions! Please see our Contributing Guide for details.

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Support

Documentation: docs/
Issues: GitHub Issues
Discussions: GitHub Discussions

Name		Name	Last commit message	Last commit date
Latest commit History 115 Commits
.cursor/rules		.cursor/rules
.github/workflows		.github/workflows
data		data
demo		demo
docker		docker
docs		docs
ontocast		ontocast
test		test
.cursorignore		.cursorignore
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
graph.mmd		graph.mmd
logging.debug.conf		logging.debug.conf
logging.info.conf		logging.info.conf
logging.warning.conf		logging.warning.conf
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
shell.nix		shell.nix
uv.lock		uv.lock

License

growgraph/ontocast

Folders and files

Latest commit

History

Repository files navigation

OntoCast

Agentic ontology-assisted framework for semantic triple extraction

Overview

Key Features

Applications

Installation

Optional features: document processing (PDFs, PPT, OCR, semantic chunking):

Quick Start

1. Configuration

2. Start Server

3. Process Documents

4. API Endpoints

LLM Caching

Cache Locations

Benefits

Custom Cache Directory

Configuration System

Environment Variables

Triple Store Configuration

CLI Parameters

Triple Store Setup

Quick Setup with Docker

Documentation

Recent Changes

Ontology Management Improvements

GraphUpdate System

Budget Tracking

Configuration System Overhaul

Enhanced Features

Examples

Basic Usage

Server Usage

Contributing

License

Support

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 9

Packages 0

Contributors 3

Languages

Packages