Skip to content

Conversation

@ngngwr
Copy link
Collaborator

@ngngwr ngngwr commented Dec 1, 2025

Issues

  • My PR addresses the following Helix issues and references them in the PR description:

Fixes production NullPointerException in WagedRebalancer when resources are deleted but still exist in cached assignment metadata store.

Description

`Exception while executing DEFAULT pipeline for cluster uic-hs-33. Will not continue to next pipeline
Exception Chain
Format

String
java.lang.NullPointerException
at jdk.internal.reflect.GeneratedConstructorAccessor244.newInstance(null:-1)
at jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:499)
at java.lang.reflect.Constructor.newInstance(Constructor.java:480)
at java.util.concurrent.ForkJoinTask.getThrowableException(ForkJoinTask.java:564)
at java.util.concurrent.ForkJoinTask.reportException(ForkJoinTask.java:591)
at java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:689)
at java.util.stream.ForEachOps$ForEachOp.evaluateParallel(ForEachOps.java:159)
at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateParallel(ForEachOps.java:173)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233)
at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:596)
at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:765)
at org.apache.helix.controller.rebalancer.waged.WagedRebalancer.computeNewIdealStates(WagedRebalancer.java:277)
at org.apache.helix.controller.stages.BestPossibleStateCalcStage.computeResourceBestPossibleStateWithWagedRebalancer(BestPossibleStateCalcStage.java:445)
at org.apache.helix.controller.stages.BestPossibleStateCalcStage.compute(BestPossibleStateCalcStage.java:289)
at org.apache.helix.controller.stages.BestPossibleStateCalcStage.process(BestPossibleStateCalcStage.java:94)
at org.apache.helix.controller.pipeline.Pipeline.handle(Pipeline.java:75)
at org.apache.helix.controller.GenericHelixController.handleEvent(GenericHelixController.java:905)
at org.apache.helix.controller.GenericHelixController$ClusterEventProcessor.run(GenericHelixController.java:1556)
java.lang.NullPointerException`

Root Cause:

  1. A resource is deleted from the cluster (IdealState removed from ZooKeeper)
  2. The assignment metadata store still contains stale assignment data for the deleted resource
  3. When rebalance fails and falls back to cached assignment, newIdealStates contains IdealStates for deleted resources
  4. The resourceMap (built fresh from ZooKeeper) does not contain the deleted resource
  5. In the parallel stream at line 290, resourceMap.get(resourceName) returns null
  6. The null resource is passed to computeBestPossiblePartitionState(), causing NPE

Tests

  • The following tests are written for this issue:

(List the names of added unit/integration tests)

  • The following is the result of the "mvn test" command on the appropriate module:

(If CI test fails due to known issue, please specify the issue and test PR locally. Then copy & paste the result of "mvn test" to here.)

mvn test -Dtest=TestWagedRebalancer -pl helix-core

// the target assignment.
newIdealStates.values().parallelStream().forEach(idealState -> {
String resourceName = idealState.getResourceName();
Resource resource = resourceMap.get(resourceName);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you share with me the path where we are getting assignments, because I see in AssignmentManager.getBestPossibleAssignment(), we are filtering keys not in resource map ?

Also, can we add filtering while creating/getting cached assignments, that way we can even update the assignments and may avoid NPE elsewhere.

Copy link
Collaborator

@proud-parselmouth proud-parselmouth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants