-
Notifications
You must be signed in to change notification settings - Fork 397
Telemetry-safe error reporting for native extensions #5076
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Telemetry-safe error reporting for native extensions #5076
Conversation
This gets us closer to allowing these errors to be sent to telemetry.
Add ruby_helpers.h include to 8 C files that use datadog_profiling_error_class and datadog_profiling_internal_error_class but were missing the header declaration. This fixes the compilation error: error: 'datadog_profiling_error_class' undeclared Files fixed: - clock_id_from_pthread.c - collectors_gc_profiling_helper.c - collectors_stack.c - collectors_thread_context.c - encoded_profile.c - libdatadog_helpers.c - private_vm_api_access.c - unsafe_api_calls_check.c
Move ruby_helpers.h include after private VM headers to avoid conflicts. This file requires private VM headers to be included first before any public Ruby headers, but ruby_helpers.h includes datadog_ruby_common.h which includes ruby.h, causing header ordering conflicts. Fixes compilation error: 'expected ')' before '==' token in RHASH_EMPTY_P'
Cannot include ruby_helpers.h in this file as it pulls in public Ruby headers (via datadog_ruby_common.h) that conflict with private VM headers. Instead, declare the exception class globals as extern, following the pattern already established in this file for other declarations. This fully resolves the header ordering compilation error.
Method was renamed from safe_exception_message to constant_exception_message but the RBS signature file was not updated, causing Steep type errors.
The error method must be public but was accidentally made private when constant_exception_message was added. Moving it before the private keyword restores its public visibility. Fixes test failure: NoMethodError: private method 'error' called
Serialization errors contain dynamic libdatadog content, so they should raise ProfilingInternalError (not ProfilingError or RuntimeError). Updated both the Ruby wrapper code and the test expectation to use ProfilingInternalError consistently. Fixes test failure expecting ProfilingError but getting RuntimeError.
BenchmarksBenchmark execution time: 2025-12-01 23:29:10 Comparing candidate commit f92ebb4 in PR branch Found 0 performance improvements and 0 performance regressions! Performance is the same for 44 metrics, 2 unstable metrics. |
…remediation-custom-profiler-code' into marcotc/error-logs-remediation-custom-profiler-code
b2d87f4 to
6a6bdfb
Compare
Signed-off-by: Marco Costa <marco.costa@datadoghq.com>
|
✅ Tests 🎉 All green!❄️ No new flaky tests detected 🎯 Code Coverage 🔗 Commit SHA: 976a297 | Docs | Datadog PR Page | Was this helpful? Give us feedback! |
|
The failure in the benchmarks... is because the "use the test code from this branch with master" doesn't work if master isn't API-compatible with this branch. TBH what I've done in the past is just disable the benchmark and re-enable it later, but it's kinda meh workaround. Suggestions welcome 😅 |
ivoanjo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking great! The one reason I didn't press the approve button is due to the raise_error in datadog_ruby_common.c -- that one doesn't quite look correct.
| void datadog_ruby_common_init(VALUE datadog_module) { | ||
| // No longer needed - using Ruby's built-in exception classes | ||
| (void)datadog_module; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest removing this + its callsites; it's a bit of complexity (including comments in "profiling.c" saying this needs to be called first) and I'm not sure it's worth leaving around.
| void raise_error(VALUE error_class, const char *fmt, ...) { | ||
| va_list args; | ||
| va_start(args, fmt); | ||
| VALUE message = rb_vsprintf(fmt, args); | ||
| va_end(args); | ||
| rb_raise(error_class, "%"PRIsVALUE, message); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, I'm confused by this... Is this a leftover from an earlier version?
That is, I thought the correct version of raise_error was the #define raise_error we have in ruby_helpers.c (the one that grabs the format string separately, etc); this one seems to just be a weird direct passthrough to rb_raise.
Am I missing something? 🤔 👀
| it "is an instance of RuntimeError" do | ||
| expect { raise_native_runtime_error }.to raise_exception(RuntimeError) | ||
| end | ||
| end | ||
|
|
||
| context "when raising ArgumentError" do | ||
| subject(:raise_native_argument_error) do | ||
| described_class::Testing._native_grab_gvl_and_raise(::ArgumentError, "argument error test", nil, true) | ||
| end | ||
|
|
||
| it "raises an ArgumentError" do | ||
| expect { raise_native_argument_error }.to raise_error(::ArgumentError) do |error| | ||
| expect(error.message).to eq("argument error test") | ||
| expect(error.instance_variable_get(:@telemetry_message)).to eq("argument error test") | ||
| end | ||
| end | ||
|
|
||
| it "is an instance of ArgumentError" do | ||
| expect { raise_native_argument_error }.to raise_exception(ArgumentError) | ||
| end | ||
| end | ||
|
|
||
| context "when raising TypeError" do | ||
| subject(:raise_native_type_error) do | ||
| described_class::Testing._native_grab_gvl_and_raise(::TypeError, "type error test", nil, true) | ||
| end | ||
|
|
||
| it "raises a TypeError" do | ||
| expect { raise_native_type_error }.to raise_error(::TypeError) do |error| | ||
| expect(error.message).to eq("type error test") | ||
| expect(error.instance_variable_get(:@telemetry_message)).to eq("type error test") | ||
| end | ||
| end | ||
|
|
||
| it "is an instance of TypeError" do | ||
| expect { raise_native_type_error }.to raise_exception(TypeError) | ||
| end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: The "is an instance of" tests can be removed, since the raise_error(...) validations in the "it raises ..." testcases already include that.
| #define ENFORCE_SUCCESS_HELPER(expression, have_gvl) \ | ||
| { int result_syserr_errno = expression; if (RB_UNLIKELY(result_syserr_errno)) raise_syserr(result_syserr_errno, have_gvl, ADD_QUOTES(expression), __FILE__, __LINE__, __func__); } | ||
| { int result_syserr_errno = expression; if (RB_UNLIKELY(result_syserr_errno)) raise_enforce_syserr(result_syserr_errno, have_gvl, ADD_QUOTES(expression), __FILE__, __LINE__, __func__); } | ||
|
|
||
| #define RUBY_NUM_OR_NIL(val, condition, conv) ((val condition) ? conv(val) : Qnil) | ||
| #define RUBY_AVG_OR_NIL(total, count) ((count == 0) ? Qnil : DBL2NUM(((double) total) / count)) | ||
|
|
||
| // Called by ENFORCE_SUCCESS_HELPER; should not be used directly | ||
| NORETURN(void raise_syserr( | ||
| NORETURN(void raise_enforce_syserr( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: To match the private naming we used above, maybe these should become private_raise_syserr?
| // Make sure to *not* invoke Ruby code as this function can run in unsafe contexts. | ||
| // NOTE: Raising the exception acquires the GVL (unsafe), but it also aborts the current execution flow. | ||
| // @see debug_enter_unsafe_context |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // Make sure to *not* invoke Ruby code as this function can run in unsafe contexts. | |
| // NOTE: Raising the exception acquires the GVL (unsafe), but it also aborts the current execution flow. | |
| // @see debug_enter_unsafe_context | |
| // NOTE: Raising an exception always invokes Ruby code so it requires the GVL and is not compatible with "debug_enter_unsafe_context". | |
| // @see debug_enter_unsafe_context |
(I'll squash the commits on reviews finish)
What does this PR do?
This PR ensure errors raised from native
ext/code have enough context to be valuable in telemetry.Motivation:
Because error information sent to telemetry cannot have arbitrary, dynamic data, we must ensure that we are only sending values that are known at gem build time.
Because our fallback is to only report the exception class, this PR adds a
telemetry_messageto native exceptions, so that their error information is not completely lost.Change log entry
Yes. Telemetry: Added static error reporting for native extensions.Additional Notes:
There's a follow up PR to reduce code duplication for the
datadog_ruby_common.c/hfiles: #5088How to test the change?
Easiest and fastest test it:
bundle exec rake clean compile && bundle exec rake spec:profilingThere are more products supported by
libdatadog, but at that point, just run CI :)