feat(cli): add stats command to display graph statistics #255

AndyBodnar · 2026-01-10T18:13:45Z

Summary

Adds a new graph-code stats command that displays statistics about the indexed codebase knowledge graph.

Changes

Added stats command to cli.py
Added STATS to CLICommandName enum in cli_help.py
Added CMD_STATS help text
Added constants for table column headers and error messages

Example Output

╭─ Node Statistics ────────────────────╮
│ Node Type      │ Count              │
├────────────────┼────────────────────┤
│ Function       │ 1,234              │
│ Class          │ 156                │
│ Module         │ 89                 │
├────────────────┼────────────────────┤
│ Total Nodes    │ 1,479              │
╰──────────────────────────────────────╯

╭─ Relationship Statistics ────────────╮
│ Relationship Type │ Count           │
├───────────────────┼─────────────────┤
│ CALLS             │ 3,456           │
│ CONTAINS          │ 1,245           │
│ IMPORTS           │ 567             │
├───────────────────┼─────────────────┤
│ Total Relationships │ 5,268         │
╰──────────────────────────────────────╯

Test plan

Python syntax valid
Run graph-code stats with Memgraph connected
Verify node counts match expected values
Verify relationship counts display correctly

Fixes #248

Adds a new `graph-code stats` command that displays statistics about the indexed codebase graph including: - Node counts by type (Function, Class, Module, etc.) - Relationship counts by type (CALLS, CONTAINS, INHERITS, etc.) - Total node and relationship counts - Formatted output using Rich tables Fixes vitali87#248

gemini-code-assist · 2026-01-10T18:14:01Z

Summary of Changes

Hello @AndyBodnar, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new command-line interface feature that allows users to quickly gain insights into the structure and content of their indexed codebase knowledge graph. By providing a clear, tabulated summary of node and relationship statistics, it enhances the usability of the tool, enabling better understanding and debugging of the graph data.

Highlights

New stats CLI Command: A new stats command has been introduced to the graph-code command-line interface, allowing users to display statistics about their indexed codebase knowledge graph.
Detailed Graph Statistics: The command provides a summary of node counts by label and relationship counts by type, offering a clear overview of the graph's composition.
Formatted Output with rich: Statistics are presented in well-formatted, color-coded tables using the rich library, enhancing readability and user experience. Total counts for both nodes and relationships are included.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a new stats command to the CLI, which is a great addition for visibility into the knowledge graph. The implementation is straightforward, but I've identified a few areas for improvement concerning robustness, adherence to API contracts, and code structure. My main suggestions involve making the Cypher queries more robust, using public methods instead of internal ones, removing an unused CLI parameter, and considering refactoring duplicated code for better maintainability, while acknowledging the trade-off with readability. These changes will make the new command more reliable and easier to maintain in the future.

codebase_rag/cli.py

greptile-apps · 2026-01-10T18:18:48Z

Greptile Overview

Greptile Summary

This PR adds a new graph-code stats command that displays statistics about the indexed codebase knowledge graph, showing node counts by type and relationship counts by type in formatted Rich tables.

What Changed

cli.py: Added 82-line stats() command function that queries Memgraph for node and relationship statistics
cli_help.py: Added STATS to CLICommandName enum and corresponding help text
constants.py: Added 8 new constants for stats command (table titles, column headers, error messages)
logs.py: Added STATS_ERROR log message template

Integration with Codebase

The implementation follows the existing CLI command pattern:

Uses Typer for argument parsing with --batch-size option
Leverages connect_memgraph() context manager from main.py
Uses Rich library for table formatting (consistent with other commands)
Follows error handling pattern with try/except and typer.Exit(1)

Issues Identified

Critical Issues (Must Fix)

Multi-label node handling: The Cypher query labels(n)[0] only captures the first label of each node. In graph databases, nodes can have multiple labels (e.g., :Function and :Exported), but this query would only count each node under its primary label, potentially misrepresenting the graph structure.

Style/Best Practice Issues (Should Fix)

Inline comments without (H) marker: Lines 398, 402, 410, 433 contain inline comments that violate the project's strict comment policy
Hardcoded Cypher queries: The two Cypher queries should be defined as constants in cypher_queries.py following the "Single Source of Truth" principle
Private method usage: The code calls ingestor._execute_query() instead of the public fetch_all() method, breaking encapsulation

Positive Aspects

Proper error handling with exception catching and logging
Constants properly organized in constants.py and logs.py
Consistent with existing CLI command structure
Good use of Rich library for formatted output
Help text and enum values properly added

Confidence Score: 2/5

This PR has one critical logic issue with multi-label node handling and multiple style violations that should be addressed before merging
Score of 2 reflects: (1) One logic issue that could produce incorrect statistics for nodes with multiple labels - this is a functional correctness problem; (2) Multiple style violations including inline comments, hardcoded queries, and private method usage that deviate from the project's strict coding standards; (3) The implementation is otherwise functionally complete with proper error handling. While the feature works, the multi-label issue means the statistics may not accurately represent the graph, and the style violations need fixing to maintain code quality.
Pay close attention to codebase_rag/cli.py, particularly the Cypher query on line 400 which handles multi-label nodes incorrectly

Important Files Changed

File Analysis

Filename	Score	Overview
codebase_rag/cli.py	2/5	Added new `stats` command with multiple style violations (inline comments without (H) marker, hardcoded Cypher queries, use of private method) and one logic issue (multi-label node handling)
codebase_rag/cli_help.py	5/5	Added STATS enum value and help text following existing patterns correctly, no issues found
codebase_rag/constants.py	5/5	Added stats-related constants (table headers, titles, error messages) following project conventions correctly, no issues found
codebase_rag/logs.py	5/5	Added STATS_ERROR log message following existing pattern correctly, no issues found

Sequence Diagram

sequenceDiagram
    participant User
    participant CLI as cli.py:stats()
    participant Settings as config.settings
    participant Main as main.py
    participant Ingestor as MemgraphIngestor
    participant Memgraph as Memgraph DB
    participant Console as Rich Console

    User->>CLI: graph-code stats [--batch-size N]
    CLI->>Console: Print "Connecting to Memgraph..."
    CLI->>Settings: resolve_batch_size(batch_size)
    Settings-->>CLI: effective_batch_size
    CLI->>Main: connect_memgraph(effective_batch_size)
    Main->>Ingestor: __init__(host, port, batch_size)
    Main->>Ingestor: __enter__()
    Ingestor->>Memgraph: connect()
    Memgraph-->>Ingestor: connection established
    Ingestor-->>Main: ingestor instance
    Main-->>CLI: ingestor
    
    CLI->>Ingestor: _execute_query("MATCH (n) RETURN labels(n)[0]...")
    Ingestor->>Memgraph: Execute Cypher query (node counts)
    Memgraph-->>Ingestor: node statistics results
    Ingestor-->>CLI: node_results
    
    CLI->>Ingestor: _execute_query("MATCH ()-[r]->() RETURN type(r)...")
    Ingestor->>Memgraph: Execute Cypher query (relationship counts)
    Memgraph-->>Ingestor: relationship statistics results
    Ingestor-->>CLI: rel_results
    
    CLI->>CLI: Calculate total_nodes and total_rels
    CLI->>Console: Print node statistics table
    CLI->>Console: Print relationship statistics table
    
    CLI->>Ingestor: __exit__()
    Ingestor->>Memgraph: close connection
    
    alt Success
        CLI->>User: Display statistics tables
    else Exception
        CLI->>Console: Print error message
        CLI->>User: Exit with code 1
    end

greptile-apps

_{6 files reviewed, 6 comments}

_{Edit Code Review Agent Settings | Greptile}

codebase_rag/cli.py

vitali87 · 2026-01-10T19:01:29Z

@AndyBodnar please clear all comments flagged by the bots. I will review only after they are all resolved

- Move Cypher queries to cypher_queries.py as constants - Use public fetch_all() instead of private _execute_query() - Fix multi-label handling by returning all labels and joining with ':'

AndyBodnar · 2026-01-10T19:19:40Z

I've addressed all the bot review comments in the latest commit:

Multi-label node handling - Fixed by returning all labels(n) and joining them with : for display
Hardcoded Cypher queries - Moved to cypher_queries.py as CYPHER_STATS_NODE_COUNTS and CYPHER_STATS_RELATIONSHIP_COUNTS constants
Private method usage - Changed from _execute_query() to the public fetch_all() method
Inline comments - Removed the inline comments

Ready for review when you have time!

AndyBodnar · 2026-01-12T01:48:05Z

All the bot comments have been resolved. Ready for your review whenever you have time.

vitali87 · 2026-01-16T23:42:12Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a useful stats command to display graph statistics. The implementation is straightforward and uses rich to present the data nicely. I've added a few suggestions to improve code clarity in the data processing loops and to make the error handling more robust. Overall, a good addition to the CLI.

codebase_rag/cli.py

Removed the batch_size parameter since stats only reads from the database - write buffering doesn't apply here. Also removed an inline comment that was flagged for violating the project's comment policy.

AndyBodnar · 2026-01-17T05:21:45Z

Cleaned up the remaining items:

Removed the batch_size parameter from the stats command since it only does reads
Removed the inline comment that was flagged by the linter

The earlier commit already moved the Cypher queries to cypher_queries.py and switched to fetch_all() instead of the private method. Should be good to go now.

AndyBodnar · 2026-01-17T17:12:04Z

Thanks for catching those! Yeah the broad Exception catch is lazy on my part, I'll narrow it down to the specific mgclient errors. And good call on the redundant int/string conversions, will clean those up too. Pushing the fixes shortly.

vitali87 · 2026-01-18T14:42:24Z

Thank you @AndyBodnar . Excellent job! Let me run and verify the functionality and if all good, will approve and merge 🙌

Thanks for catching those! Yeah the broad Exception catch is lazy on my part, I'll narrow it down to the specific mgclient errors. And good call on the redundant int/string conversions, will clean those up too. Pushing the fixes shortly.

AndyBodnar · 2026-01-18T22:05:19Z

did everything clean up efficiently ?

vitali87 · 2026-01-19T10:59:40Z

Code Review: Type Safety, Linting, and Test Coverage Required

I've tested this PR thoroughly and found several issues that need to be addressed before merging.

Important: Running the pre-commit hook is mandatory before submitting changes. This would have caught the linting and type check errors below. Please ensure you run uv run ruff check --fix . && uv run ruff format . && uv run ty check codebase_rag/ before pushing commits.

1. Import Ordering (Linting Error)

File: codebase_rag/cli.py

The imports are not properly sorted. Running uv run ruff check codebase_rag/cli.py reports:

I001 [*] Import block is un-sorted or un-formatted

Fix: Run uv run ruff check --fix codebase_rag/cli.py or manually reorder the imports so that local imports from .cypher_queries come after .main (alphabetical order within the local import group).

2. Type Safety Issues with `ResultRow` (Type Check Errors)

File: codebase_rag/cli.py

Running uv run ty check codebase_rag/cli.py reports 3 errors. The root cause is that ResultRow is typed as:

type ResultRow = dict[str, ResultValue]
type ResultValue = ResultScalar | list[ResultScalar] | dict[str, ResultScalar]
type ResultScalar = str | int | float | bool | None

When calling row.get("count", 0), the return type is ResultValue, which includes None, list, and dict variants that cannot be passed directly to int() or used with str.join().

Error 1 & 2: `int()` conversion on line 397-398

total_nodes = sum(int(row.get("count", 0)) for row in node_results)
total_rels = sum(int(row.get("count", 0)) for row in rel_results)

Problem: row.get("count", 0) returns ResultValue which can be None, list, or dict — types that int() cannot accept.

Error 3: `str.join()` on line 410

labels = row.get("labels", [])
label = ":".join(labels) if labels else "Unknown"

Problem: row.get("labels", []) returns ResultValue, not list[str], so str.join() cannot accept it.

Fix: Add helper functions with proper type narrowing and import ResultRow:

from .types_defs import ResultRow


def _get_count(row: ResultRow) -> int:
    val = row.get("count", 0)
    return int(val) if isinstance(val, (int, float)) else 0


def _get_labels(row: ResultRow) -> list[str]:
    val = row.get("labels", [])
    if isinstance(val, list):
        return [str(v) for v in val if v is not None]
    return []

Then update the stats function to use these helpers:

total_nodes = sum(_get_count(row) for row in node_results)
total_rels = sum(_get_count(row) for row in rel_results)

# In the node loop:
labels = _get_labels(row)
count = _get_count(row)

# In the relationship loop:
count = _get_count(row)

3. Missing Test Coverage

The PR adds a new CLI command but includes no tests. At minimum, the following should be tested:

Unit Tests (mock the database)

_get_count() helper with various ResultRow inputs (int, float, None, list, dict)
_get_labels() helper with various inputs
Stats command output formatting with mock data

Integration Tests (using testcontainers/Memgraph)

Stats on empty database returns zero totals
Stats after indexing a small fixture codebase shows correct counts
Stats with connection failure returns proper error and exit code 1

Suggested test file: codebase_rag/tests/test_cli_stats.py

Example test structure:

import pytest
from codebase_rag.cli import _get_count, _get_labels
from codebase_rag.types_defs import ResultRow


class TestGetCount:
    def test_int_value(self) -> None:
        row: ResultRow = {"count": 42}
        assert _get_count(row) == 42

    def test_float_value(self) -> None:
        row: ResultRow = {"count": 3.7}
        assert _get_count(row) == 3

    def test_none_value(self) -> None:
        row: ResultRow = {"count": None}
        assert _get_count(row) == 0

    def test_missing_key(self) -> None:
        row: ResultRow = {}
        assert _get_count(row) == 0

    def test_invalid_type_returns_zero(self) -> None:
        row: ResultRow = {"count": ["not", "an", "int"]}
        assert _get_count(row) == 0


class TestGetLabels:
    def test_list_of_strings(self) -> None:
        row: ResultRow = {"labels": ["Function", "Method"]}
        assert _get_labels(row) == ["Function", "Method"]

    def test_empty_list(self) -> None:
        row: ResultRow = {"labels": []}
        assert _get_labels(row) == []

    def test_missing_key(self) -> None:
        row: ResultRow = {}
        assert _get_labels(row) == []

    def test_none_values_filtered(self) -> None:
        row: ResultRow = {"labels": ["Function", None, "Class"]}
        assert _get_labels(row) == ["Function", "Class"]

    def test_non_list_returns_empty(self) -> None:
        row: ResultRow = {"labels": "not a list"}
        assert _get_labels(row) == []

Testing Results (After Fixes)

After applying the fixes above, all checks pass:

Check	Status
`uv run ruff check`	✅
`uv run ty check`	✅
`uv run pytest`	✅ 2781 passed

Manual Testing

Scenario	Result
Empty database	✅ Shows tables with 0 totals
Populated database	✅ Correct node/relationship counts displayed
Memgraph unavailable	✅ Graceful error, exit code 1
`--help` flag	✅ Proper help text

Summary

Run pre-commit hooks — This is mandatory and would catch these issues automatically
Fix import ordering with ruff check --fix
Add _get_count() and _get_labels() helper functions with proper type narrowing
Import ResultRow from .types_defs
Update the stats function to use the helper functions
Add unit tests for the helper functions and stats command behavior

The code works at runtime because Memgraph returns the correct types, but it doesn't pass our static type checks since ResultRow is typed broadly. Please address the items above to satisfy linting and type checking requirements. If you have any questions or comments, let me know.

vitali87 · 2026-01-19T11:01:34Z

@AndyBodnar see above regarding my test report.

github-project-automation bot added this to @vitali87's graph code Jan 10, 2026

gemini-code-assist bot reviewed Jan 10, 2026

View reviewed changes

codebase_rag/cli.py Outdated Show resolved Hide resolved

codebase_rag/cli.py Outdated Show resolved Hide resolved

codebase_rag/cli.py Show resolved Hide resolved

codebase_rag/cli.py Outdated Show resolved Hide resolved

codebase_rag/cli.py Outdated Show resolved Hide resolved

greptile-apps bot reviewed Jan 10, 2026

View reviewed changes

fix: address bot review comments for stats command

019b237

- Move Cypher queries to cypher_queries.py as constants - Use public fetch_all() instead of private _execute_query() - Fix multi-label handling by returning all labels and joining with ':'

gemini-code-assist bot reviewed Jan 16, 2026

View reviewed changes

codebase_rag/cli.py Outdated Show resolved Hide resolved

codebase_rag/cli.py Outdated Show resolved Hide resolved

codebase_rag/cli.py Outdated Show resolved Hide resolved

Clean up stats command based on review feedback

4995868

Removed the batch_size parameter since stats only reads from the database - write buffering doesn't apply here. Also removed an inline comment that was flagged for violating the project's comment policy.

fix exception handling and clean up type conversions

f07efdd

Uh oh!

feat(cli): add stats command to display graph statistics #255

Are you sure you want to change the base?

feat(cli): add stats command to display graph statistics #255

Conversation

AndyBodnar commented Jan 10, 2026

Summary

Changes

Example Output

Test plan

Uh oh!

gemini-code-assist bot commented Jan 10, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

greptile-apps bot commented Jan 10, 2026

Greptile Overview

Greptile Summary

What Changed

Integration with Codebase

Issues Identified

Critical Issues (Must Fix)

Style/Best Practice Issues (Should Fix)

Positive Aspects

Confidence Score: 2/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vitali87 commented Jan 10, 2026

Uh oh!

AndyBodnar commented Jan 10, 2026

Uh oh!

AndyBodnar commented Jan 12, 2026

Uh oh!

vitali87 commented Jan 16, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AndyBodnar commented Jan 17, 2026

Uh oh!

AndyBodnar commented Jan 17, 2026

Uh oh!

vitali87 commented Jan 18, 2026

Uh oh!

AndyBodnar commented Jan 18, 2026

Uh oh!

vitali87 commented Jan 19, 2026

Code Review: Type Safety, Linting, and Test Coverage Required

1. Import Ordering (Linting Error)

2. Type Safety Issues with ResultRow (Type Check Errors)

Error 1 & 2: int() conversion on line 397-398

Error 3: str.join() on line 410

3. Missing Test Coverage

Unit Tests (mock the database)

Integration Tests (using testcontainers/Memgraph)

Testing Results (After Fixes)

2. Type Safety Issues with `ResultRow` (Type Check Errors)

Error 1 & 2: `int()` conversion on line 397-398

Error 3: `str.join()` on line 410