-
Notifications
You must be signed in to change notification settings - Fork 242
Description
Describe the bug
Apache Ambari Metrics is using Helix for cluster management tasks. Recently tried to upgrade the Helix dependency from 0.6.6 to 1.3.2 / 1.4.3; however, we are seeing a failure in Metrics Collector startup when the Hadoop cluster is deployed in kerberos enabled mode with the newer version of Helix.
Based on the investigation, I would like to pin down the issues because of the change in the Helix Core Zk initialisation which fails to create the zookeeper client and service shutdown is triggered with below error in the trace. Have checked and confirm for the zookeeper connectivity and the znode ambari-metrics-cluster to be present with node information.
2025-09-17 10:54:29,633 WARN org.apache.helix.manager.zk.ZKHelixManager: zkClient to testnode01.mycluster.org:2181 is not connected, wait for 10000ms.
2025-09-17 10:54:39,635 ERROR org.apache.helix.manager.zk.ZKHelixManager: zkClient is not connected after waiting 10000ms., > clusterName: ambari-metrics-cluster, zkAddress: testnode01.mycluster.org:2181
ERROR org.apache.helix.manager.zk.ZKHelixManager: fail to createClient. retry 1
org.apache.helix.HelixException: HelixManager is not connected within retry timeout for cluster ambari-metrics-cluster
at org.apache.helix.manager.zk.ZKHelixManager.checkConnected(ZKHelixManager.java:417)
at org.apache.helix.manager.zk.ZKHelixManager.getConfigAccessor(ZKHelixManager.java:687)
at org.apache.helix.manager.zk.ParticipantManager.(ParticipantManager.java:118)
at org.apache.helix.manager.zk.ZKHelixManager.handleNewSessionAsParticipant(ZKHelixManager.java:1440)
at org.apache.helix.manager.zk.ZKHelixManager.handleNewSession(ZKHelixManager.java:1390)
at org.apache.helix.manager.zk.ZKHelixManager.createClient(ZKHelixManager.java:782)
at org.apache.helix.manager.zk.ZKHelixManager.connect(ZKHelixManager.java:817)
at org.apache.ambari.metrics.core.timeline.availability.AggregationTaskRunner.initialize(AggregationTaskRunner.java:135)
at org.apache.ambari.metrics.core.timeline.availability.MetricCollectorHAController.startAggregators(MetricCollectorHAController.java:205)
at org.apache.ambari.metrics.core.timeline.availability.MetricCollectorHAController.initializeHAController(MetricCollectorHAController.java:184)
at org.apache.ambari.metrics.core.timeline.HBaseTimelineMetricsService.initializeSubsystem(HBaseTimelineMetricsService.java:133)
at org.apache.ambari.metrics.core.timeline.HBaseTimelineMetricsService.serviceInit(HBaseTimelineMetricsService.java:102)
I am trying to understand the change in behaviour from the library side and appropriate fix for the issue and tried few approaches by trying to set zk timeout with system properties, -D arguments and setting helix.zk session and connection timeouts, rewriting ZkHelixManager object initialisation by adding a RealmAwareZkClient, RealmAwareZkClientConfig, CloudConfig and HelixManagerProperty object instances using required parameters, but so far none seem to have worked. Request to kindly help and guide with an appropriate fix for the issue.
For reference(Apache Ambari Metrics Helix upgrade apache/ambari-metrics#173) and (JDK-17 Support apache/ambari-metrics#134)
cc: @jackjlli / @Jackie-Jiang
To Reproduce
Steps to reproduce the behavior.
Expected behavior
A clear and concise description of what you expected to happen.
Additional context
Add any other context about the problem here.