Skip to content

Conversation

@AndyBodnar
Copy link

Summary

Adds a new graph-code stats command that displays statistics about the indexed codebase knowledge graph.

Changes

  • Added stats command to cli.py
  • Added STATS to CLICommandName enum in cli_help.py
  • Added CMD_STATS help text
  • Added constants for table column headers and error messages

Example Output

╭─ Node Statistics ────────────────────╮
│ Node Type      │ Count              │
├────────────────┼────────────────────┤
│ Function       │ 1,234              │
│ Class          │ 156                │
│ Module         │ 89                 │
├────────────────┼────────────────────┤
│ Total Nodes    │ 1,479              │
╰──────────────────────────────────────╯

╭─ Relationship Statistics ────────────╮
│ Relationship Type │ Count           │
├───────────────────┼─────────────────┤
│ CALLS             │ 3,456           │
│ CONTAINS          │ 1,245           │
│ IMPORTS           │ 567             │
├───────────────────┼─────────────────┤
│ Total Relationships │ 5,268         │
╰──────────────────────────────────────╯

Test plan

  • Python syntax valid
  • Run graph-code stats with Memgraph connected
  • Verify node counts match expected values
  • Verify relationship counts display correctly

Fixes #248

Adds a new `graph-code stats` command that displays statistics about
the indexed codebase graph including:
- Node counts by type (Function, Class, Module, etc.)
- Relationship counts by type (CALLS, CONTAINS, INHERITS, etc.)
- Total node and relationship counts
- Formatted output using Rich tables

Fixes vitali87#248
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @AndyBodnar, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new command-line interface feature that allows users to quickly gain insights into the structure and content of their indexed codebase knowledge graph. By providing a clear, tabulated summary of node and relationship statistics, it enhances the usability of the tool, enabling better understanding and debugging of the graph data.

Highlights

  • New stats CLI Command: A new stats command has been introduced to the graph-code command-line interface, allowing users to display statistics about their indexed codebase knowledge graph.
  • Detailed Graph Statistics: The command provides a summary of node counts by label and relationship counts by type, offering a clear overview of the graph's composition.
  • Formatted Output with rich: Statistics are presented in well-formatted, color-coded tables using the rich library, enhancing readability and user experience. Total counts for both nodes and relationships are included.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new stats command to the CLI, which is a great addition for visibility into the knowledge graph. The implementation is straightforward, but I've identified a few areas for improvement concerning robustness, adherence to API contracts, and code structure. My main suggestions involve making the Cypher queries more robust, using public methods instead of internal ones, removing an unused CLI parameter, and considering refactoring duplicated code for better maintainability, while acknowledging the trade-off with readability. These changes will make the new command more reliable and easier to maintain in the future.

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Jan 10, 2026

Greptile Overview

Greptile Summary

This PR adds a new graph-code stats command that displays statistics about the indexed codebase knowledge graph, showing node counts by type and relationship counts by type in formatted Rich tables.

What Changed

  • cli.py: Added 82-line stats() command function that queries Memgraph for node and relationship statistics
  • cli_help.py: Added STATS to CLICommandName enum and corresponding help text
  • constants.py: Added 8 new constants for stats command (table titles, column headers, error messages)
  • logs.py: Added STATS_ERROR log message template

Integration with Codebase

The implementation follows the existing CLI command pattern:

  • Uses Typer for argument parsing with --batch-size option
  • Leverages connect_memgraph() context manager from main.py
  • Uses Rich library for table formatting (consistent with other commands)
  • Follows error handling pattern with try/except and typer.Exit(1)

Issues Identified

Critical Issues (Must Fix)

  1. Multi-label node handling: The Cypher query labels(n)[0] only captures the first label of each node. In graph databases, nodes can have multiple labels (e.g., :Function and :Exported), but this query would only count each node under its primary label, potentially misrepresenting the graph structure.

Style/Best Practice Issues (Should Fix)

  1. Inline comments without (H) marker: Lines 398, 402, 410, 433 contain inline comments that violate the project's strict comment policy
  2. Hardcoded Cypher queries: The two Cypher queries should be defined as constants in cypher_queries.py following the "Single Source of Truth" principle
  3. Private method usage: The code calls ingestor._execute_query() instead of the public fetch_all() method, breaking encapsulation

Positive Aspects

  • Proper error handling with exception catching and logging
  • Constants properly organized in constants.py and logs.py
  • Consistent with existing CLI command structure
  • Good use of Rich library for formatted output
  • Help text and enum values properly added

Confidence Score: 2/5

  • This PR has one critical logic issue with multi-label node handling and multiple style violations that should be addressed before merging
  • Score of 2 reflects: (1) One logic issue that could produce incorrect statistics for nodes with multiple labels - this is a functional correctness problem; (2) Multiple style violations including inline comments, hardcoded queries, and private method usage that deviate from the project's strict coding standards; (3) The implementation is otherwise functionally complete with proper error handling. While the feature works, the multi-label issue means the statistics may not accurately represent the graph, and the style violations need fixing to maintain code quality.
  • Pay close attention to codebase_rag/cli.py, particularly the Cypher query on line 400 which handles multi-label nodes incorrectly

Important Files Changed

File Analysis

Filename Score Overview
codebase_rag/cli.py 2/5 Added new stats command with multiple style violations (inline comments without (H) marker, hardcoded Cypher queries, use of private method) and one logic issue (multi-label node handling)
codebase_rag/cli_help.py 5/5 Added STATS enum value and help text following existing patterns correctly, no issues found
codebase_rag/constants.py 5/5 Added stats-related constants (table headers, titles, error messages) following project conventions correctly, no issues found
codebase_rag/logs.py 5/5 Added STATS_ERROR log message following existing pattern correctly, no issues found

Sequence Diagram

sequenceDiagram
    participant User
    participant CLI as cli.py:stats()
    participant Settings as config.settings
    participant Main as main.py
    participant Ingestor as MemgraphIngestor
    participant Memgraph as Memgraph DB
    participant Console as Rich Console

    User->>CLI: graph-code stats [--batch-size N]
    CLI->>Console: Print "Connecting to Memgraph..."
    CLI->>Settings: resolve_batch_size(batch_size)
    Settings-->>CLI: effective_batch_size
    CLI->>Main: connect_memgraph(effective_batch_size)
    Main->>Ingestor: __init__(host, port, batch_size)
    Main->>Ingestor: __enter__()
    Ingestor->>Memgraph: connect()
    Memgraph-->>Ingestor: connection established
    Ingestor-->>Main: ingestor instance
    Main-->>CLI: ingestor
    
    CLI->>Ingestor: _execute_query("MATCH (n) RETURN labels(n)[0]...")
    Ingestor->>Memgraph: Execute Cypher query (node counts)
    Memgraph-->>Ingestor: node statistics results
    Ingestor-->>CLI: node_results
    
    CLI->>Ingestor: _execute_query("MATCH ()-[r]->() RETURN type(r)...")
    Ingestor->>Memgraph: Execute Cypher query (relationship counts)
    Memgraph-->>Ingestor: relationship statistics results
    Ingestor-->>CLI: rel_results
    
    CLI->>CLI: Calculate total_nodes and total_rels
    CLI->>Console: Print node statistics table
    CLI->>Console: Print relationship statistics table
    
    CLI->>Ingestor: __exit__()
    Ingestor->>Memgraph: close connection
    
    alt Success
        CLI->>User: Display statistics tables
    else Exception
        CLI->>Console: Print error message
        CLI->>User: Exit with code 1
    end
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 files reviewed, 6 comments

Edit Code Review Agent Settings | Greptile

@vitali87
Copy link
Owner

@AndyBodnar please clear all comments flagged by the bots. I will review only after they are all resolved

- Move Cypher queries to cypher_queries.py as constants
- Use public fetch_all() instead of private _execute_query()
- Fix multi-label handling by returning all labels and joining with ':'
@AndyBodnar
Copy link
Author

I've addressed all the bot review comments in the latest commit:

  1. Multi-label node handling - Fixed by returning all labels(n) and joining them with : for display
  2. Hardcoded Cypher queries - Moved to cypher_queries.py as CYPHER_STATS_NODE_COUNTS and CYPHER_STATS_RELATIONSHIP_COUNTS constants
  3. Private method usage - Changed from _execute_query() to the public fetch_all() method
  4. Inline comments - Removed the inline comments

Ready for review when you have time!

@AndyBodnar
Copy link
Author

All the bot comments have been resolved. Ready for your review whenever you have time.

@vitali87
Copy link
Owner

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a useful stats command to display graph statistics. The implementation is straightforward and uses rich to present the data nicely. I've added a few suggestions to improve code clarity in the data processing loops and to make the error handling more robust. Overall, a good addition to the CLI.

Removed the batch_size parameter since stats only reads from the
database - write buffering doesn't apply here. Also removed an
inline comment that was flagged for violating the project's
comment policy.
@AndyBodnar
Copy link
Author

Cleaned up the remaining items:

  • Removed the batch_size parameter from the stats command since it only does reads
  • Removed the inline comment that was flagged by the linter

The earlier commit already moved the Cypher queries to cypher_queries.py and switched to fetch_all() instead of the private method. Should be good to go now.

@AndyBodnar
Copy link
Author

Thanks for catching those! Yeah the broad Exception catch is lazy on my part, I'll narrow it down to the specific mgclient errors. And good call on the redundant int/string conversions, will clean those up too. Pushing the fixes shortly.

@vitali87
Copy link
Owner

Thank you @AndyBodnar . Excellent job! Let me run and verify the functionality and if all good, will approve and merge 🙌

Thanks for catching those! Yeah the broad Exception catch is lazy on my part, I'll narrow it down to the specific mgclient errors. And good call on the redundant int/string conversions, will clean those up too. Pushing the fixes shortly.

@AndyBodnar
Copy link
Author

did everything clean up efficiently ?

@vitali87
Copy link
Owner

Code Review: Type Safety, Linting, and Test Coverage Required

I've tested this PR thoroughly and found several issues that need to be addressed before merging.

Important: Running the pre-commit hook is mandatory before submitting changes. This would have caught the linting and type check errors below. Please ensure you run uv run ruff check --fix . && uv run ruff format . && uv run ty check codebase_rag/ before pushing commits.


1. Import Ordering (Linting Error)

File: codebase_rag/cli.py

The imports are not properly sorted. Running uv run ruff check codebase_rag/cli.py reports:

I001 [*] Import block is un-sorted or un-formatted

Fix: Run uv run ruff check --fix codebase_rag/cli.py or manually reorder the imports so that local imports from .cypher_queries come after .main (alphabetical order within the local import group).


2. Type Safety Issues with ResultRow (Type Check Errors)

File: codebase_rag/cli.py

Running uv run ty check codebase_rag/cli.py reports 3 errors. The root cause is that ResultRow is typed as:

type ResultRow = dict[str, ResultValue]
type ResultValue = ResultScalar | list[ResultScalar] | dict[str, ResultScalar]
type ResultScalar = str | int | float | bool | None

When calling row.get("count", 0), the return type is ResultValue, which includes None, list, and dict variants that cannot be passed directly to int() or used with str.join().

Error 1 & 2: int() conversion on line 397-398

total_nodes = sum(int(row.get("count", 0)) for row in node_results)
total_rels = sum(int(row.get("count", 0)) for row in rel_results)

Problem: row.get("count", 0) returns ResultValue which can be None, list, or dict — types that int() cannot accept.

Error 3: str.join() on line 410

labels = row.get("labels", [])
label = ":".join(labels) if labels else "Unknown"

Problem: row.get("labels", []) returns ResultValue, not list[str], so str.join() cannot accept it.

Fix: Add helper functions with proper type narrowing and import ResultRow:

from .types_defs import ResultRow


def _get_count(row: ResultRow) -> int:
    val = row.get("count", 0)
    return int(val) if isinstance(val, (int, float)) else 0


def _get_labels(row: ResultRow) -> list[str]:
    val = row.get("labels", [])
    if isinstance(val, list):
        return [str(v) for v in val if v is not None]
    return []

Then update the stats function to use these helpers:

total_nodes = sum(_get_count(row) for row in node_results)
total_rels = sum(_get_count(row) for row in rel_results)

# In the node loop:
labels = _get_labels(row)
count = _get_count(row)

# In the relationship loop:
count = _get_count(row)

3. Missing Test Coverage

The PR adds a new CLI command but includes no tests. At minimum, the following should be tested:

Unit Tests (mock the database)

  • _get_count() helper with various ResultRow inputs (int, float, None, list, dict)
  • _get_labels() helper with various inputs
  • Stats command output formatting with mock data

Integration Tests (using testcontainers/Memgraph)

  • Stats on empty database returns zero totals
  • Stats after indexing a small fixture codebase shows correct counts
  • Stats with connection failure returns proper error and exit code 1

Suggested test file: codebase_rag/tests/test_cli_stats.py

Example test structure:

import pytest
from codebase_rag.cli import _get_count, _get_labels
from codebase_rag.types_defs import ResultRow


class TestGetCount:
    def test_int_value(self) -> None:
        row: ResultRow = {"count": 42}
        assert _get_count(row) == 42

    def test_float_value(self) -> None:
        row: ResultRow = {"count": 3.7}
        assert _get_count(row) == 3

    def test_none_value(self) -> None:
        row: ResultRow = {"count": None}
        assert _get_count(row) == 0

    def test_missing_key(self) -> None:
        row: ResultRow = {}
        assert _get_count(row) == 0

    def test_invalid_type_returns_zero(self) -> None:
        row: ResultRow = {"count": ["not", "an", "int"]}
        assert _get_count(row) == 0


class TestGetLabels:
    def test_list_of_strings(self) -> None:
        row: ResultRow = {"labels": ["Function", "Method"]}
        assert _get_labels(row) == ["Function", "Method"]

    def test_empty_list(self) -> None:
        row: ResultRow = {"labels": []}
        assert _get_labels(row) == []

    def test_missing_key(self) -> None:
        row: ResultRow = {}
        assert _get_labels(row) == []

    def test_none_values_filtered(self) -> None:
        row: ResultRow = {"labels": ["Function", None, "Class"]}
        assert _get_labels(row) == ["Function", "Class"]

    def test_non_list_returns_empty(self) -> None:
        row: ResultRow = {"labels": "not a list"}
        assert _get_labels(row) == []

Testing Results (After Fixes)

After applying the fixes above, all checks pass:

Check Status
uv run ruff check
uv run ty check
uv run pytest ✅ 2781 passed

Manual Testing

Scenario Result
Empty database ✅ Shows tables with 0 totals
Populated database ✅ Correct node/relationship counts displayed
Memgraph unavailable ✅ Graceful error, exit code 1
--help flag ✅ Proper help text

Summary

  1. Run pre-commit hooks — This is mandatory and would catch these issues automatically
  2. Fix import ordering with ruff check --fix
  3. Add _get_count() and _get_labels() helper functions with proper type narrowing
  4. Import ResultRow from .types_defs
  5. Update the stats function to use the helper functions
  6. Add unit tests for the helper functions and stats command behavior

The code works at runtime because Memgraph returns the correct types, but it doesn't pass our static type checks since ResultRow is typed broadly. Please address the items above to satisfy linting and type checking requirements. If you have any questions or comments, let me know.

@vitali87
Copy link
Owner

@AndyBodnar see above regarding my test report.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

Add graph statistics command to show node and relationship counts

2 participants