This provides a blueprint to go from database connection string to up and running SQL Data Agent in seconds. Connect to any database that can be used with SQLAlchemy (might need to install specific engine). Builds search and convenience tools on top of a DB, made for SQL agents to interact with through the MCP protocol. Plus mirrored REST endpoint that can be connected to UIs etc.
- Read-only SQL gateway: SQLAlchemy engines are locked to safe commands defined in
[tool.dac.settings.allowed_sql_commands]. - Automatic metadata: LLM agents summarise tables; LanceDB stores summaries plus sentence-transformer embeddings (embeddings are currently unused).
- Column-content retrieval: Distinct textual values are sampled, filtered, and indexed with LanceDB BM25 for direct content search in columns.
- MCP + REST:
/mcpserves FastMCP, while/widgets/*exposes REST endpoints for UI integration with OpenBB (customize this to your preferred UI). - Config-driven:
databases.tomldeclares available data sources,pyproject.tomlunder[tool.dac.settings]for runtime settings and.env(or environment variables) configures the LLM provider.
| Tool | Summary |
|---|---|
| get_databases | Lists registered databases and descriptions. |
| show_tables / show_views | Enumerates tables/views with cached annotations where available. |
| describe_table / describe_view | Returns DDL-like metadata or view SQL. |
| get_distinct_values | Pulls sample categorical values (limit enforced). |
| preview_table | Returns first rows of non-binary columns. |
| find_relevant_columns_and_content | BM25 search over distinct textual values with score filtering. |
| query_database | Executes read-only SQL with a configurable row cap (mcp_query_limit). |
| join_path | Suggests shortest join sequences or Steiner-tree paths across tables. |
-
Clone the repository:
git clone https://github.com/MagnusS0/DataAgentConnector.git cd DataAgentConnector -
Install dependencies:
uv sync --group ai
-
Configure your databases in
databases.toml:[databases.my_database] connection_string = "sqlite:///path/to/your/database.db" description = "My local SQLite database" [databases.another_database] connection_string = "postgresql://user:password@localhost:5432/another_database" description = "Another PostgreSQL database"
-
Set up your LLM provider in
.env:LLM_API_KEY=your_api_key_here LLM_MODEL_NAME=default-model LLM_BASE_URL=https://api.your-llm-provider.com
-
Run the application:
uv run uvicorn app.main:app --reload
DataAgentConnector/
├── app/
│ ├── agents/
│ ├── core/
│ ├── domain/
│ ├── interfaces/
│ ├── models/
│ ├── schemas/
│ ├── repositories/
│ ├── services/
│ └── main.py
├── databases.toml
├── pyproject.toml
├── .env
└── README.md
- Column extraction (
app/domain/extract_colum_content.py) samples distinct textual values while filtering binary, numeric, or overly long fields; tunable viatool.dac.settings.fts_extraction_options. - FTS indexing (
app/services/indexing_service.py) persists values into LanceDB tables namedcolumn_contents_<database>and builds BM25 indexes. - Annotation workflow (
app/services/annotation_service.py) runs LLM prompts with table metadata, previews, and sampled values (schema hashes used to skip already processed tables), embeddings are added via sentence-transformers.
Foreign key constraints are analyzed to build a cached CSR adjacency matrix (app/domain/fk_analyzer.py) where tables are nodes and FKs are edges. For two tables, BFS finds the shortest join sequence. For 3+ tables, an approximate Steiner tree (MST on all-pairs distances) computes the minimal spanning network, returning ordered JoinStep objects with FK column mappings.
This allows agents to request optimal join paths across multiple tables when formulating SQL queries. Even when there is no direct foreign key relationship defined in the database schema.
Indexing and annotating all of BIRD-SQL training databases (69 databases) results in:
- Table annotations stored successfully in ~200 seconds
- Content FTS indices created successfully in ~5 seconds
Hardware: Intel i9-14900K, 64GB RAM, RTX 3090 running Menlo/Jan-nano (4B params) using vLLM