Skip to content

Ambari metrics facing issues with helix-core version 1.3.2 / 1.4.3 #3071

@vishalsuvagia

Description

@vishalsuvagia

Describe the bug

Apache Ambari Metrics is using Helix for cluster management tasks. Recently tried to upgrade the Helix dependency from 0.6.6 to 1.3.2 / 1.4.3; however, we are seeing a failure in Metrics Collector startup when the Hadoop cluster is deployed in kerberos enabled mode with the newer version of Helix.

Based on the investigation, I would like to pin down the issues because of the change in the Helix Core Zk initialisation which fails to create the zookeeper client and service shutdown is triggered with below error in the trace. Have checked and confirm for the zookeeper connectivity and the znode ambari-metrics-cluster to be present with node information.

2025-09-17 10:54:29,633 WARN org.apache.helix.manager.zk.ZKHelixManager: zkClient to testnode01.mycluster.org:2181 is not connected, wait for 10000ms.
2025-09-17 10:54:39,635 ERROR org.apache.helix.manager.zk.ZKHelixManager: zkClient is not connected after waiting 10000ms., > clusterName: ambari-metrics-cluster, zkAddress: testnode01.mycluster.org:2181
ERROR org.apache.helix.manager.zk.ZKHelixManager: fail to createClient. retry 1
org.apache.helix.HelixException: HelixManager is not connected within retry timeout for cluster ambari-metrics-cluster
at org.apache.helix.manager.zk.ZKHelixManager.checkConnected(ZKHelixManager.java:417)
at org.apache.helix.manager.zk.ZKHelixManager.getConfigAccessor(ZKHelixManager.java:687)
at org.apache.helix.manager.zk.ParticipantManager.(ParticipantManager.java:118)
at org.apache.helix.manager.zk.ZKHelixManager.handleNewSessionAsParticipant(ZKHelixManager.java:1440)
at org.apache.helix.manager.zk.ZKHelixManager.handleNewSession(ZKHelixManager.java:1390)
at org.apache.helix.manager.zk.ZKHelixManager.createClient(ZKHelixManager.java:782)
at org.apache.helix.manager.zk.ZKHelixManager.connect(ZKHelixManager.java:817)
at org.apache.ambari.metrics.core.timeline.availability.AggregationTaskRunner.initialize(AggregationTaskRunner.java:135)
at org.apache.ambari.metrics.core.timeline.availability.MetricCollectorHAController.startAggregators(MetricCollectorHAController.java:205)
at org.apache.ambari.metrics.core.timeline.availability.MetricCollectorHAController.initializeHAController(MetricCollectorHAController.java:184)
at org.apache.ambari.metrics.core.timeline.HBaseTimelineMetricsService.initializeSubsystem(HBaseTimelineMetricsService.java:133)
at org.apache.ambari.metrics.core.timeline.HBaseTimelineMetricsService.serviceInit(HBaseTimelineMetricsService.java:102)

I am trying to understand the change in behaviour from the library side and appropriate fix for the issue and tried few approaches by trying to set zk timeout with system properties, -D arguments and setting helix.zk session and connection timeouts, rewriting ZkHelixManager object initialisation by adding a RealmAwareZkClient, RealmAwareZkClientConfig, CloudConfig and HelixManagerProperty object instances using required parameters, but so far none seem to have worked. Request to kindly help and guide with an appropriate fix for the issue.
For reference(Apache Ambari Metrics Helix upgrade apache/ambari-metrics#173) and (JDK-17 Support apache/ambari-metrics#134)

cc: @jackjlli / @Jackie-Jiang

To Reproduce

Steps to reproduce the behavior.

Expected behavior

A clear and concise description of what you expected to happen.

Additional context

Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions