Skip to content

growgraph/ontocast

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OntoCast Agentic Ontology Triplecast logo

Agentic ontology-assisted framework for semantic triple extraction

Python PyPI version PyPI Downloads License pre-commit DOI


Overview

OntoCast is a framework for extracting semantic triples (creating a Knowledge Graph) from documents using an agentic, ontology-driven approach. It combines ontology management, natural language processing, and knowledge graph serialization to turn unstructured text into structured, queryable data.


Key Features

  • Ontology-Guided Extraction: Ensures semantic consistency and co-evolves ontologies
  • Entity Disambiguation: Resolves references across document chunks
  • Multi-Format Support: Handles text, JSON, PDF, and Markdown
  • Semantic Chunking: Splits text based on semantic similarity
  • MCP Compatibility: Implements Model Control Protocol endpoints
  • RDF Output: Produces standardized RDF/Turtle
  • Triple Store Integration: Supports Neo4j (n10s) and Apache Fuseki
  • Hierarchical Configuration: Type-safe configuration system with environment variable support
  • CLI Parameters: Flexible command-line interface with --skip-ontology-critique option
  • Automatic LLM Caching: Built-in response caching for improved performance and cost reduction
  • GraphUpdate Operations: Token-efficient SPARQL-based updates instead of full graph regeneration
  • Budget Tracking: Comprehensive tracking of LLM usage and triple generation metrics
  • Ontology Versioning: Automatic semantic versioning with hash-based lineage tracking

Applications

OntoCast can be used for:

  • Knowledge Graph Construction: Build domain-specific or general-purpose knowledge graphs from documents
  • Semantic Search: Power search and retrieval with structured triples
  • GraphRAG: Enable retrieval-augmented generation over knowledge graphs (e.g., with LLMs)
  • Ontology Management: Automate ontology creation, validation, and refinement
  • Data Integration: Unify data from diverse sources into a semantic graph

Installation

uv add ontocast[doc-processing] 
# or
pip install ontocast

Optional features: document processing (PDFs, PPT, OCR, semantic chunking):

uv add "ontocast[doc-processing]"
# or
pip install "ontocast[doc-processing]"

Quick Start

1. Configuration

Create a .env file with your configuration:

# LLM Configuration
LLM_PROVIDER=openai
LLM_API_KEY=your-api-key-here
LLM_MODEL_NAME=gpt-4o-mini
LLM_TEMPERATURE=0.1

# Server Configuration
PORT=8999
MAX_VISITS=3
RECURSION_LIMIT=1000
ESTIMATED_CHUNKS=30
ONTOLOGY_MAX_TRIPLES=10000

# Path Configuration
ONTOCAST_WORKING_DIRECTORY=/path/to/working
ONTOCAST_ONTOLOGY_DIRECTORY=/path/to/ontologies
ONTOCAST_CACHE_DIR=/path/to/cache

# Optional: Triple Store Configuration
FUSEKI_URI=http://localhost:3032/test
FUSEKI_AUTH=admin:password
FUSEKI_DATASET=ontocast

# Optional: Skip ontology critique
SKIP_ONTOLOGY_DEVELOPMENT=false
# Optional: Maximum triples allowed in ontology graph (set empty for unlimited)
ONTOLOGY_MAX_TRIPLES=10000

2. Start Server

ontocast \
    --env-path .env \
    --working-directory /path/to/working \
    --ontology-directory /path/to/ontologies

3. Process Documents

curl -X POST http://localhost:8999/process -F "file=@document.pdf"

4. API Endpoints

The OntoCast server provides the following endpoints:

  • POST /process: Process documents and extract semantic triples

    curl -X POST http://localhost:8999/process -F "file=@document.pdf"
  • POST /flush: Flush/clean triple store data

    # Clean all datasets (Fuseki) or entire database (Neo4j)
    curl -X POST http://localhost:8999/flush
    
    # Clean specific Fuseki dataset
    curl -X POST "http://localhost:8999/flush?dataset=my_dataset"

    Note: For Fuseki, you can specify a dataset query parameter to clean a specific dataset. If omitted, all datasets are cleaned. For Neo4j, the dataset parameter is ignored and all data is deleted.

  • GET /health: Health check endpoint

  • GET /info: Service information endpoint


LLM Caching

OntoCast includes automatic LLM response caching to improve performance and reduce API costs. Caching is enabled by default and requires no configuration.

Cache Locations

  • Tests: .test_cache/llm/ in the current working directory
  • Windows: %USERPROFILE%\AppData\Local\ontocast\llm\
  • Unix/Linux: ~/.cache/ontocast/llm/ (or $XDG_CACHE_HOME/ontocast/llm/)

Benefits

  • Faster Execution: Repeated queries return cached responses instantly
  • Cost Reduction: Identical requests don't hit the LLM API
  • Offline Capability: Tests can run without API access if responses are cached
  • Transparent: No configuration required - works automatically

Custom Cache Directory

If you need to specify a custom cache directory:

from pathlib import Path
from ontocast.tool.llm import LLMTool

# Cache directory is managed automatically by Cacher
llm_tool = LLMTool.create(
    config=llm_config
)

Configuration System

OntoCast uses a hierarchical configuration system built on Pydantic BaseSettings:

Environment Variables

Variable Description Default Required
LLM_API_KEY API key for LLM provider - Yes
LLM_PROVIDER LLM provider (openai, ollama) openai No
LLM_MODEL_NAME Model name gpt-4o-mini No
LLM_TEMPERATURE Temperature setting 0.1 No
ONTOCAST_WORKING_DIRECTORY Working directory path - Yes
ONTOCAST_ONTOLOGY_DIRECTORY Ontology files directory - No
PORT Server port 8999 No
MAX_VISITS Maximum visits per node 3 No
SKIP_ONTOLOGY_DEVELOPMENT Skip ontology critique false No
ONTOLOGY_MAX_TRIPLES Maximum triples allowed in ontology graph 10000 No
SKIP_FACTS_RENDERING Skip facts rendering and go straight to aggregation false No
ONTOCAST_CACHE_DIR Custom cache directory for LLM responses Platform default No

Triple Store Configuration

# Fuseki (Preferred)
FUSEKI_URI=http://localhost:3032/test
FUSEKI_AUTH=admin:password
FUSEKI_DATASET=dataset_name

# Neo4j (Alternative)
NEO4J_URI=bolt://localhost:7689
NEO4J_AUTH=neo4j:password

CLI Parameters

# Skip ontology critique step
ontocast --skip-ontology-critique

# Process only first N chunks (for testing)
ontocast --head-chunks 5

Triple Store Setup

OntoCast supports multiple triple store backends with automatic fallback:

  1. Apache Fuseki (Recommended) - Native RDF with SPARQL support
  2. Neo4j with n10s - Graph database with RDF capabilities
  3. Filesystem (Fallback) - Local file-based storage

When multiple triple stores are configured, Fuseki is preferred over Neo4j.

Quick Setup with Docker

Fuseki:

cd docker/fuseki
cp .env.example .env
# Edit .env with your values
docker compose --env-file .env fuseki up -d

Neo4j:

cd docker/neo4j
cp .env.example .env
# Edit .env with your values
docker compose --env-file .env neo4j up -d

See Triple Store Setup for detailed instructions.


Documentation


Recent Changes

Ontology Management Improvements

  • Automatic Versioning: Semantic version increment based on change analysis (MAJOR/MINOR/PATCH)
  • Hash-Based Lineage: Git-style versioning with parent hashes for tracking ontology evolution
  • Multiple Version Storage: Versions stored as separate named graphs in Fuseki triple stores
  • Timestamp Tracking: updated_at field tracks when ontology was last modified
  • Smart Version Analysis: Analyzes ontology changes (classes, properties, instances) to determine appropriate version bump

GraphUpdate System

  • Token Efficiency: LLM outputs structured SPARQL operations (insert/delete) instead of full TTL graphs
  • Incremental Updates: Only changes are generated, dramatically reducing token usage
  • Structured Operations: TripleOp operations with explicit prefix declarations for precise updates
  • SPARQL Generation: Automatic conversion of operations to executable SPARQL queries

Budget Tracking

  • LLM Statistics: Tracks API calls, characters sent/received for cost monitoring
  • Triple Metrics: Tracks ontology and facts triples generated per operation
  • Summary Reports: Budget summaries logged at end of processing
  • Integrated Tracking: Budget tracker integrated into AgentState for clean dependency injection

Configuration System Overhaul

  • Hierarchical Configuration: New ToolConfig and ServerConfig structure
  • Environment Variables: Support for .env files and environment variables
  • Type Safety: Full type safety with Python 3.12 union syntax
  • API Key: Changed from OPENAI_API_KEY to LLM_API_KEY for consistency
  • Dependency Injection: Removed global variables, implemented proper DI

Enhanced Features

  • CLI Parameters: New --skip-ontology-critique and --skip-facts-rendering parameters
  • RDFGraph Operations: Improved __iadd__ method with proper prefix binding
  • Triple Store Management: Better separation between filesystem and external stores
  • Serialization Interface: Unified serialize() method for storing Ontology and RDFGraph objects
  • Error Handling: Improved error handling and validation

See CHANGELOG.md for complete details.


Examples

Basic Usage

from ontocast.config import Config
from ontocast.toolbox import ToolBox

# Load configuration
config = Config()

# Initialize tools
tools = ToolBox(config)

# Process documents
# ... (use tools for processing)

Server Usage

# Start server with custom configuration
ontocast \
    --env-path .env \
    --working-directory /data/working \
    --ontology-directory /data/ontologies \
    --skip-ontology-critique \
    --head-chunks 10

Contributing

We welcome contributions! Please see our Contributing Guide for details.


License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.


Support