feat: Implement DBT Context System (ANS-412)#26
Conversation
- Add DBTContextSystem class implementing ContextProvider protocol
- Parse manifest.json for models, columns, documentation, materialization
- Format context as LLM-readable text with clear sections
- Handle empty descriptions, missing files, invalid JSON gracefully
- Cache manifest for performance across multiple queries
- Include rich metadata (model count, DBT version, project path)
- Support intervention system (works with remove_percentage, shuffle, etc.)
- Add comprehensive test suite (6 tests) with mock manifest data
- Include usage example in examples/dbt_context_example.py
- Export from main contexts module for easy import
Design uses 'load' terminology - users load their DBT projects into Semiosis
for context-aware agent evaluation and SIT experiments.
Usage:
dbt_context = DBTContextSystem('./my_dbt_project')
context, metadata = dbt_context.get_context('SELECT * FROM users')
reduced = remove_percentage(dbt_context, 0.3) # Apply intervention
WalkthroughAdds a DBTContextSystem provider that loads/parses DBT Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes
Possibly related issues
Possibly related PRs
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (11)
semiosis/contexts/providers/dbt_context.py (4)
51-51: Rename unused loop variable to_node_id.The loop variable
node_idis unused. Per Python convention, prefix it with an underscore to indicate it's intentionally ignored.- for node_id, node in self.manifest.get("nodes", {}).items(): + for _node_id, node in self.manifest.get("nodes", {}).items():
76-77: Add return type hint to private method.Per coding guidelines, comprehensive type hints are required. The
_load_manifestmethod is missing a return type annotation.- def _load_manifest(self): + def _load_manifest(self) -> None:
86-92: Chain exceptions to preserve original traceback.When re-raising exceptions, use
from eto chain them. This preserves the original traceback and helps with debugging.try: with open(manifest_path) as f: self.manifest = json.load(f) except json.JSONDecodeError as e: - raise ValueError(f"Invalid JSON in manifest.json: {e}") + raise ValueError(f"Invalid JSON in manifest.json: {e}") from e except Exception as e: - raise RuntimeError(f"Failed to load manifest.json: {e}") + raise RuntimeError(f"Failed to load manifest.json: {e}") from e
94-95: Expand docstring to Google-style format.Per coding guidelines, use Google-style docstrings for all public methods. While
_format_modelis private, adding Args/Returns documentation improves maintainability.def _format_model(self, model: Dict[str, Any]) -> str: - """Format a DBT model for LLM context.""" + """Format a DBT model for LLM context. + + Args: + model: Dictionary containing DBT model metadata from manifest. + + Returns: + Formatted string representation of the model for LLM consumption. + """ parts = []examples/dbt_context_example.py (2)
1-10: File has shebang but may not be executable.The shebang on line 1 suggests this file should be directly executable, but the file permissions may not be set accordingly.
Either make the file executable:
chmod +x examples/dbt_context_example.pyOr remove the shebang if direct execution isn't intended:
-#!/usr/bin/env python3 """ Example usage of DBT Context System.
49-54: Consider logging the exception type for debugging.While catching a generic
Exceptionis acceptable in example code, adding the exception type could help users diagnose issues more easily.except FileNotFoundError: print("Error: No manifest.json found.") print("Please run 'dbt compile' in your DBT project first.") except Exception as e: - print(f"Error: {e}") + print(f"Error ({type(e).__name__}): {e}")tests/contexts/test_dbt_context.py (5)
74-79: Add explicit encoding when writing files.Using
open()without an explicit encoding relies on the system default, which can vary across platforms.# Write manifest.json manifest_path = target_dir / "manifest.json" - with open(manifest_path, "w") as f: + with open(manifest_path, "w", encoding="utf-8") as f: json.dump(mock_manifest, f)This applies to similar
open()calls intest_invalid_json_error(line 127) andtest_no_models_in_manifest(line 145) as well.
102-108: Prefix unused variable with underscore.The
metadatavariable is unpacked but never used.def test_empty_descriptions_handled(self, temp_dbt_project: str): """Test that empty descriptions are handled gracefully.""" dbt_context = DBTContextSystem(temp_dbt_project) - context, metadata = dbt_context.get_context("SELECT amount FROM orders") + context, _metadata = dbt_context.get_context("SELECT amount FROM orders") # Empty description should show "No description" assert "amount: No description" in context
110-116: Escape the regex metacharacter in the match pattern.The
.in"No manifest.json found"is a regex metacharacter that matches any character. Use a raw string with escaped dot for precise matching.def test_missing_manifest_error(self): """Test error when manifest.json is missing.""" with tempfile.TemporaryDirectory() as temp_dir: dbt_context = DBTContextSystem(temp_dir) - with pytest.raises(FileNotFoundError, match="No manifest.json found"): + with pytest.raises(FileNotFoundError, match=r"No manifest\.json found"): dbt_context.get_context("SELECT * FROM users")
118-133: Escape regex metacharacter in match pattern.Same issue as above—the
.in"Invalid JSON"could match any character (though less likely to cause issues here).For consistency, consider using a raw string:
- with pytest.raises(ValueError, match="Invalid JSON"): + with pytest.raises(ValueError, match=r"Invalid JSON"):
155-168: Prefix unused metadata variables with underscore.Both
metadata1andmetadata2are unpacked but never used.def test_manifest_caching(self, temp_dbt_project: str): """Test that manifest is loaded once and cached.""" dbt_context = DBTContextSystem(temp_dbt_project) # First call loads manifest - context1, metadata1 = dbt_context.get_context("query1") + context1, _metadata1 = dbt_context.get_context("query1") manifest_after_first = dbt_context.manifest # Second call should use cached manifest - context2, metadata2 = dbt_context.get_context("query2") + context2, _metadata2 = dbt_context.get_context("query2") manifest_after_second = dbt_context.manifest
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
examples/dbt_context_example.py(1 hunks)semiosis/contexts/README.md(4 hunks)semiosis/contexts/__init__.py(1 hunks)semiosis/contexts/protocol.py(1 hunks)semiosis/contexts/providers/__init__.py(1 hunks)semiosis/contexts/providers/dbt_context.py(1 hunks)tests/contexts/test_dbt_context.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Use Google-style docstrings for all public methods and classes
Comprehensive type hints required throughout the codebase
Use Black for code formatting with 88 character line length
Use isort with Black profile for import sorting
Use flake8 for linting Python code
Use mypy for type checking in Python
Use deprecation warnings before removing functionality
Files:
semiosis/contexts/providers/__init__.pyexamples/dbt_context_example.pysemiosis/contexts/providers/dbt_context.pytests/contexts/test_dbt_context.pysemiosis/contexts/__init__.pysemiosis/contexts/protocol.py
tests/**/test_*.py
📄 CodeRabbit inference engine (CLAUDE.md)
tests/**/test_*.py: Test files should usetest_*.pynaming convention and be located intests/directory
Mark tests requiring API keys with@pytest.mark.requires_apidecorator
Tests should gracefully skip if API keys are unavailable
Provide fixture data for reproducible tests
Files:
tests/contexts/test_dbt_context.py
🧬 Code graph analysis (4)
semiosis/contexts/providers/__init__.py (1)
semiosis/contexts/providers/dbt_context.py (1)
DBTContextSystem(12-123)
examples/dbt_context_example.py (3)
semiosis/contexts/providers/dbt_context.py (1)
get_context(30-74)semiosis/contexts/interventions.py (1)
remove_percentage(49-76)semiosis/contexts/protocol.py (1)
get_context(27-45)
semiosis/contexts/providers/dbt_context.py (1)
semiosis/contexts/protocol.py (1)
get_context(27-45)
semiosis/contexts/__init__.py (1)
semiosis/contexts/providers/dbt_context.py (1)
DBTContextSystem(12-123)
🪛 Ruff (0.14.6)
examples/dbt_context_example.py
1-1: Shebang is present but file is not executable
(EXE001)
52-52: Do not catch blind exception: Exception
(BLE001)
semiosis/contexts/providers/dbt_context.py
30-30: Unused method argument: query
(ARG002)
51-51: Loop control variable node_id not used within loop body
Rename unused node_id to _node_id
(B007)
81-84: Avoid specifying long messages outside the exception class
(TRY003)
90-90: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
90-90: Avoid specifying long messages outside the exception class
(TRY003)
91-91: Do not catch blind exception: Exception
(BLE001)
92-92: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
92-92: Avoid specifying long messages outside the exception class
(TRY003)
tests/contexts/test_dbt_context.py
105-105: Unpacked variable metadata is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
115-115: Pattern passed to match= contains metacharacters but is neither escaped nor raw
(RUF043)
160-160: Unpacked variable metadata1 is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
164-164: Unpacked variable metadata2 is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
🔇 Additional comments (9)
semiosis/contexts/providers/dbt_context.py (1)
30-30: Unusedqueryparameter is acceptable for protocol compliance.The
queryparameter is unused but required by theContextProviderprotocol. Consider adding a comment or prefixing with underscore if future query-based filtering is not planned.semiosis/contexts/providers/__init__.py (1)
1-5: LGTM!Clean package initialization with proper
__all__export. The structure follows Python conventions and correctly exposesDBTContextSystemfor public use.semiosis/contexts/protocol.py (1)
15-16: LGTM!The terminology update from "extract" to "load" is consistent with the PR's design philosophy and aligns with the README and other documentation updates.
semiosis/contexts/__init__.py (1)
16-20: LGTM!Clean addition of
DBTContextSystemto the public API. The import and export follow the existing patterns in the file.semiosis/contexts/README.md (1)
3-3: LGTM!The "load" terminology is consistent with the protocol documentation and PR design philosophy.
examples/dbt_context_example.py (1)
12-47: LGTM!The example demonstrates a clear progression from basic usage to intervention application, with informative output at each step. Good use of metadata fields and context previews.
tests/contexts/test_dbt_context.py (3)
1-10: LGTM!Imports are well-organized and include all necessary modules for the test suite.
16-64: Well-structured fixtures with comprehensive mock data.The
mock_manifestfixture includes good edge cases like empty descriptions and non-model nodes (test resources) that should be filtered out.
81-100: LGTM!Thorough assertions covering both metadata fields and context content. Good verification of model details, materializations, and tags.
- Remove redundant if self.manifest guard after _load_manifest() - Add assertion with comment explaining _load_manifest() contract - Update README to reflect that DBT implementation is now available - Remove internal issue references from user-facing documentation Addresses PR review feedback on code clarity and documentation accuracy.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
semiosis/contexts/providers/dbt_context.py (1)
50-55: Remove redundant assertion and rename unused loop variable.As noted in the previous review, the assertion on line 50 is unnecessary because
_load_manifest()either successfully setsself.manifestor raises an exception. Additionally, the loop variablenode_idon line 51 is unused and should be prefixed with an underscore.Apply this diff:
- # Load models and documentation - assert self.manifest is not None # _load_manifest() either sets or raises - for node_id, node in self.manifest.get("nodes", {}).items(): + # Load models and documentation + for _node_id, node in self.manifest.get("nodes", {}).items(): if node.get("resource_type") == "model": model_count += 1 context_parts.append(self._format_model(node))Based on past review comments addressing the redundant guard check.
🧹 Nitpick comments (2)
semiosis/contexts/providers/dbt_context.py (2)
30-39: Document why the query parameter is unused.The
queryparameter is accepted to comply with theContextProviderprotocol but is not used in the implementation—all DBT models are returned regardless of the query content. This design choice should be documented in the docstring to clarify that DBT context is query-agnostic.Consider adding a note to the docstring:
def get_context(self, query: str) -> Tuple[str, Dict[str, Any]]: """ Get DBT context for a query. Args: - query: The query/task that needs context + query: The query/task that needs context (unused - DBT context is query-agnostic) Returns: Tuple of (context_string, metadata_dict) """
66-70: Simplify redundant manifest check in metadata construction.The conditional check
if self.manifeston line 68 is redundant. After_load_manifest()executes successfully,self.manifestis guaranteed to be a dictionary, so this defensive check is unnecessary.Apply this diff:
"manifest_version": ( - self.manifest.get("metadata", {}).get("dbt_version") - if self.manifest - else None + self.manifest.get("metadata", {}).get("dbt_version") ),
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
semiosis/contexts/README.md(4 hunks)semiosis/contexts/providers/dbt_context.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- semiosis/contexts/README.md
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Use Google-style docstrings for all public methods and classes
Comprehensive type hints required throughout the codebase
Use Black for code formatting with 88 character line length
Use isort with Black profile for import sorting
Use flake8 for linting Python code
Use mypy for type checking in Python
Use deprecation warnings before removing functionality
Files:
semiosis/contexts/providers/dbt_context.py
🧬 Code graph analysis (1)
semiosis/contexts/providers/dbt_context.py (1)
semiosis/contexts/protocol.py (1)
get_context(27-45)
🪛 Ruff (0.14.6)
semiosis/contexts/providers/dbt_context.py
30-30: Unused method argument: query
(ARG002)
51-51: Loop control variable node_id not used within loop body
Rename unused node_id to _node_id
(B007)
81-84: Avoid specifying long messages outside the exception class
(TRY003)
90-90: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
90-90: Avoid specifying long messages outside the exception class
(TRY003)
91-91: Do not catch blind exception: Exception
(BLE001)
92-92: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
92-92: Avoid specifying long messages outside the exception class
(TRY003)
- Change 'raise ValueError(...)' to 'raise ValueError(...) from e' - Change 'raise RuntimeError(...)' to 'raise RuntimeError(...) from e' - Preserves original JSONDecodeError and other exceptions in tracebacks - Improves debugging experience for users Addresses PR review feedback on exception handling best practices.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
semiosis/contexts/providers/dbt_context.py (1)
30-72: Tidy unusedqueryparameter, loop variable, and manifest_version branch.
queryis currently unused but required by the protocol; this will keep tripping ARG002.node_idisn’t used in the loop body.- After the
assert self.manifest is not None, theif self.manifestbranch formanifest_versionis redundant.You can keep behavior identical while satisfying linters and simplifying metadata:
- def get_context(self, query: str) -> Tuple[str, Dict[str, Any]]: + def get_context(self, query: str) -> Tuple[str, Dict[str, Any]]: # noqa: ARG002 @@ - for node_id, node in self.manifest.get("nodes", {}).items(): + for _, node in self.manifest.get("nodes", {}).items(): @@ - "manifest_version": ( - self.manifest.get("metadata", {}).get("dbt_version") - if self.manifest - else None - ), + "manifest_version": self.manifest.get("metadata", {}).get("dbt_version"),Optionally, consider a
# TODO:about usingqueryto restrict relevant models later.Based on static analysis hints.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
semiosis/contexts/providers/dbt_context.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Use Google-style docstrings for all public methods and classes
Comprehensive type hints required throughout the codebase
Use Black for code formatting with 88 character line length
Use isort with Black profile for import sorting
Use flake8 for linting Python code
Use mypy for type checking in Python
Use deprecation warnings before removing functionality
Files:
semiosis/contexts/providers/dbt_context.py
🧬 Code graph analysis (1)
semiosis/contexts/providers/dbt_context.py (1)
semiosis/contexts/protocol.py (1)
get_context(27-45)
🪛 Ruff (0.14.6)
semiosis/contexts/providers/dbt_context.py
30-30: Unused method argument: query
(ARG002)
51-51: Loop control variable node_id not used within loop body
Rename unused node_id to _node_id
(B007)
81-84: Avoid specifying long messages outside the exception class
(TRY003)
90-90: Avoid specifying long messages outside the exception class
(TRY003)
92-92: Avoid specifying long messages outside the exception class
(TRY003)
🔇 Additional comments (2)
semiosis/contexts/providers/dbt_context.py (2)
76-93: Manifest loading and error mapping look solid.The manifest loader cleanly distinguishes:
- Missing file →
FileNotFoundErrorwith actionable guidance.- Invalid JSON →
ValueErrorchained fromJSONDecodeError.- Other I/O issues →
RuntimeErrorwith preserved traceback.This is a good boundary for callers and is consistent with the rest of the API.
94-123: Model formatting is clear and LLM-friendly.The
_format_modelhelper produces concise, structured blocks (name, description, materialization, columns with fallback text, tags). This is easy for both humans and LLMs to scan and degrades gracefully when descriptions/tags are missing.
| to provide context for LLM agents. | ||
| """ | ||
|
|
||
| def __init__(self, project_path: str): |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's check the actual content of the dbt_context.py file around the specified lines
head -n 120 semiosis/contexts/providers/dbt_context.py | tail -n +15Repository: AnswerLayer/semiosis
Length of output: 3680
🏁 Script executed:
# Also check for mypy or flake8 configuration files
find . -maxdepth 2 -name "pyproject.toml" -o -name "setup.cfg" -o -name ".flake8" -o -name "mypy.ini" | head -20Repository: AnswerLayer/semiosis
Length of output: 150
🏁 Script executed:
# Check if there are type checking configurations in the repo
find . -maxdepth 2 -type f \( -name "*.toml" -o -name "*.cfg" \) | xargs grep -l "mypy\|flake8" 2>/dev/null | head -10Repository: AnswerLayer/semiosis
Length of output: 150
Add explicit return type hints for __init__ and _load_manifest.
Both methods are missing return type annotations required by the coding guidelines. Add -> None to these methods:
- def __init__(self, project_path: str):
+ def __init__(self, project_path: str) -> None:
@@
- def _load_manifest(self):
+ def _load_manifest(self) -> None:_format_model already has the correct return type hint (-> str) and requires no changes.
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In semiosis/contexts/providers/dbt_context.py around line 20, the constructors
and manifest loader are missing return type annotations per project guidelines;
update the signatures of def __init__(self, project_path: str) and def
_load_manifest(self, ...) to include an explicit return type of -> None (i.e.,
def __init__(...) -> None: and def _load_manifest(...) -> None:), keep existing
parameter types unchanged and do not modify _format_model which already has ->
str.
Design uses 'load' terminology - users load their DBT projects into Semiosis for context-aware agent evaluation and SIT experiments.
Usage:
dbt_context = DBTContextSystem('./my_dbt_project')
context, metadata = dbt_context.get_context('SELECT * FROM users')
reduced = remove_percentage(dbt_context, 0.3) # Apply intervention
Pull Request
What does this PR do?
Brief description of the changes and why they're needed.
Related Issue
Lightweight Checklist
pytest), or rationale provided if not applicableblack,isort,flake8)mypy)Additional Notes (optional)
Any breaking changes, performance considerations, or other context for reviewers.
Summary by CodeRabbit
New Features
Documentation
Tests
✏️ Tip: You can customize this high-level summary in your review settings.