Skip to content

Conversation

@vidas
Copy link
Member

@vidas vidas commented Oct 27, 2025

Check kernel errors (return values).

Fail and complain loudly if kernel execution fails. Skips empty contexts (you don't want ~1000 debug blocks if one thread failed)(do you?).

@glguida I just remembered that I have this commit in my "working branch" - only checks for user errors, but could be a starting point to understand other kernel launching issues. Didn't find stream callbacks necessary in this case.

@glguida
Copy link
Member

glguida commented Oct 27, 2025

So this IIUC -- away from code -- implies error on execution of kernel.

For runtime errors when launching the kernel by the MM, we need this:

https://github.com/aifoundry-org/et-platform/blob/022221990ca713f946eaa4379817cb59bb852be4/esperanto-tools-libs/src/Runtime.cpp#L170

At the moment, these kind of runtime errors are ignored. When the callback is set, we can generate an exception, which will be catched by the try block.

In general this patch looks fine, but we need to arm the runtime callback to generate exceptions too.

glguida
glguida previously approved these changes Oct 27, 2025
Copy link
Member

@glguida glguida left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, now tested. This doesn't stop execution though, so it seems like we need the callback.

// Skip uninitialized contexts (debug fill pattern 0xcdcdcdcdcdcdcdcd)
if (ctx.type_ == 4 && ctx.hartId_ != 0xcdcdcdcdcdcdcdcdULL) {
int64_t kernel_return_code = ctx.userDefinedError_;
GGML_LOG_ERROR("ET: Kernel '%s' returned error code %lld on hart %lld (shire %lld)\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be very useful to print the device number here, too.

@glguida glguida self-requested a review October 27, 2025 10:39
@glguida glguida dismissed their stale review October 27, 2025 10:45

While this code is correct, I wonder if the Stream Error callback with exception is actually the right mechanism, instead of manually parsing the error.

@vidas vidas force-pushed the check-kernel-errors branch from e684bec to d88b94c Compare November 10, 2025 12:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants