Skip to content

Add heterogeneous cluster support#448

Open
saurabhcnf wants to merge 7 commits intomasterfrom
add-heterogeneous-cluster-support
Open

Add heterogeneous cluster support#448
saurabhcnf wants to merge 7 commits intomasterfrom
add-heterogeneous-cluster-support

Conversation

@saurabhcnf
Copy link
Member

Description

Adds node_type abstraction to allow tests to declare node requirements, enabling heterogeneous cluster support where different tests can request different types of nodes (e.g., "small", "large"). This creates different pools of nodes based on node_type. Tests can specify node_type when requesting specific type of nodes through the cluster annotation.

New Capability

Tests can now specify node_type in the @cluster decorator:

@cluster(num_nodes=5, node_type="large")
def test_performance(self):

Backward Compatible

node_type=None matches any available node

Cluster JSON Format

{
  "nodes": [
    {"node_type": "large", "ssh_config": {...}},
    {"node_type": "small", "ssh_config": {...}}
  ]
}

Testing

Backward compatibility test

https://semaphore.ci.confluent.io/workflows/c0f7aaa3-b8ba-4b52-a76b-ac37a89ac709

Heterogeneous cluster test

https://semaphore.ci.confluent.io/workflows/1c50cdfa-288a-4c76-ab42-5fc31d52291d?pipeline_id=7617567a-d7b4-4911-b6ac-45efd8d6c68e

Issue

https://confluentinc.atlassian.net/browse/CPTF-1412

@confluent-cla-assistant
Copy link

🎉 All Contributor License Agreements have been signed. Ready to merge.
Please push an empty commit if you would like to re-run the checks to verify CLA status for all contributors.

@saurabhcnf saurabhcnf marked this pull request as ready for review February 2, 2026 06:42
@saurabhcnf saurabhcnf requested a review from a team as a code owner February 2, 2026 06:42
Copilot AI review requested due to automatic review settings February 2, 2026 06:42
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces node type awareness into ducktape's cluster specification and allocation logic so tests can request heterogeneous clusters (e.g., "small", "large" nodes) while preserving backward compatibility when no type is specified.

Changes:

  • Extend NodeSpec, ClusterSpec, RemoteAccount, and NodeContainer to carry and use a node_type label for matching and allocation.
  • Wire node_type through JSON cluster configs, test context metadata, and the @cluster marker, plus add tests covering NodeSpec, ClusterSpec.simple_linux, and NodeContainer behavior with types.
  • Maintain backward-compatible behavior when node_type is None (treated as "match any") and via an os_to_nodes compatibility view over the new (os, node_type) grouping.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/runner/fake_remote_account.py Propagates node_type into fake accounts so tests can simulate typed nodes.
tests/cluster/check_node_container.py Adds helpers and tests validating NodeContainer grouping and allocation behavior with node_type.
tests/cluster/check_cluster_spec.py Adds tests for ClusterSpec.simple_linux and NodeSpec including node type, matching, and string representation.
ducktape/tests/test_context.py Extends expected cluster spec derivation to honor node_type alongside num_nodes.
ducktape/mark/resource.py Documents node_type as a recognized @cluster hint and shows example usage.
ducktape/mark/consts.py Introduces CLUSTER_NODE_TYPE_KEYWORD constant used in test context.
ducktape/cluster/remoteaccount.py Adds node_type to RemoteAccount to carry node classification from cluster configuration.
ducktape/cluster/node_spec.py Extends NodeSpec with node_type, matching semantics, equality, hashing, and updated string representation.
ducktape/cluster/node_container.py Refactors NodeContainer to group nodes by (os, node_type) and update remove/can_remove logic for type-aware allocation.
ducktape/cluster/json.py Reads optional node_type from JSON cluster config and passes it to remote account construction.
ducktape/cluster/cluster_spec.py Updates ClusterSpec.simple_linux and from_nodes to propagate node_type into specs.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@saurabhcnf saurabhcnf requested a review from Copilot February 3, 2026 17:24
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +166 to 168
node_type: Optional[str] = None,
logger: Optional[logging.Logger] = None,
ssh_exception_checks: List[Callable] = [],
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding node_type between externally_routable_ip and logger changes the positional signature of RemoteAccount.__init__, so existing callers that pass logger or ssh_exception_checks positionally will now have their arguments bound incorrectly. To keep this change backward compatible, consider appending node_type after the existing parameters (or making it keyword-only via a * separator) so all prior positional call sites continue to work as before.

Suggested change
node_type: Optional[str] = None,
logger: Optional[logging.Logger] = None,
ssh_exception_checks: List[Callable] = [],
logger: Optional[logging.Logger] = None,
ssh_exception_checks: List[Callable] = [],
*,
node_type: Optional[str] = None,

Copilot uses AI. Check for mistakes.
Comment on lines 111 to 121
ssh_config = RemoteAccountSSHConfig(**ninfo.get("ssh_config", {}))

# Extract node_type from JSON (optional field)
node_type = ninfo.get("node_type")

remote_account = make_remote_account_func(
ssh_config,
ninfo.get("externally_routable_ip"),
node_type=node_type,
ssh_exception_checks=kwargs.get("ssh_exception_checks"),
)
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If make_remote_account_func is a user-supplied factory, unconditionally passing a new node_type keyword argument will break existing implementations that only accept (ssh_config, externally_routable_ip, ssh_exception_checks). To preserve compatibility, either (a) keep the call signature unchanged and handle node_type inside your own factory, or (b) conditionally include the node_type keyword only when the target function's signature supports it (e.g., via inspect.signature) and clearly document the new optional parameter.

Copilot uses AI. Check for mistakes.
Comment on lines +174 to +186
def _group_spec_by_key(self, cluster_spec: ClusterSpec) -> Dict[NodeGroupKey, List["NodeSpec"]]:
"""
Group the NodeSpecs in a ClusterSpec by (os, node_type) key.

:param cluster_spec: The cluster spec to group
:return: Dictionary mapping (os, node_type) to list of NodeSpecs
"""
result: Dict[NodeGroupKey, List["NodeSpec"]] = {}
for node_spec in cluster_spec.nodes.elements():
key = (node_spec.operating_system, node_spec.node_type)
result.setdefault(key, []).append(node_spec)
return result

Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The helper _group_spec_by_key is currently unused; grouping for specs is handled via cluster_spec.nodes.grouped_by_os_and_type() instead. To avoid dead code and potential confusion, either remove this method or refactor call sites to use it consistently for spec grouping.

Suggested change
def _group_spec_by_key(self, cluster_spec: ClusterSpec) -> Dict[NodeGroupKey, List["NodeSpec"]]:
"""
Group the NodeSpecs in a ClusterSpec by (os, node_type) key.
:param cluster_spec: The cluster spec to group
:return: Dictionary mapping (os, node_type) to list of NodeSpecs
"""
result: Dict[NodeGroupKey, List["NodeSpec"]] = {}
for node_spec in cluster_spec.nodes.elements():
key = (node_spec.operating_system, node_spec.node_type)
result.setdefault(key, []).append(node_spec)
return result

Copilot uses AI. Check for mistakes.
Comment on lines 52 to 54
- ``num_nodes`` provide hint about how many nodes the test will consume
- ``node_type`` provide hint about what type of nodes the test needs (e.g., "large", "small")
- ``cluster_spec`` provide hint about how many nodes of each type the test will consume
Copy link

Copilot AI Feb 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The phrase 'provide hint' is grammatically off; it should be 'provides a hint' to read correctly (and it would be good to update the existing entries to match for consistency).

Suggested change
- ``num_nodes`` provide hint about how many nodes the test will consume
- ``node_type`` provide hint about what type of nodes the test needs (e.g., "large", "small")
- ``cluster_spec`` provide hint about how many nodes of each type the test will consume
- ``num_nodes`` provides a hint about how many nodes the test will consume
- ``node_type`` provides a hint about what type of nodes the test needs (e.g., "large", "small")
- ``cluster_spec`` provides a hint about how many nodes of each type the test will consume

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant