fix(server): fix the scheduler and the scheduler selection logic by Tsukilc · Pull Request #2937 · apache/incubator-hugegraph

Tsukilc · 2026-01-11T11:39:59Z

New Features

Server IDs now support automatic generation, eliminating the need for manual configuration.

Refactoring

Significantly simplified the task scheduling architecture by adopting a single-node scheduling path by default, removing multi-node scheduling and role-election–related controls.

Streamlined server information and scheduling management logic by eliminating redundant multi-node–related workflows.

Configuration Changes

Removed explicit scheduling type configuration; the scheduling mode is now determined by the runtime environment.

Bug Fixes

Adjusted the handling logic for task cancellation/deletion and missing tasks, improving exception handling and state management.

Tests

Updated test cases to accommodate the new scheduling behavior and serialization differences.

…into fix/scheduler

codecov · 2026-01-11T12:52:20Z

Codecov Report

❌ Patch coverage is 19.29825% with 46 lines in your changes missing coverage. Please review.
✅ Project coverage is 29.23%. Comparing base (37be6cd) to head (5416720).

Files with missing lines	Patch %	Lines
...pache/hugegraph/task/DistributedTaskScheduler.java	0.00%	35 Missing ⚠️
...g/apache/hugegraph/task/StandardTaskScheduler.java	66.66%	3 Missing and 1 partial ⚠️
...a/org/apache/hugegraph/task/ServerInfoManager.java	0.00%	3 Missing ⚠️
.../apache/hugegraph/task/TaskAndResultScheduler.java	0.00%	2 Missing ⚠️
...n/java/org/apache/hugegraph/core/GraphManager.java	75.00%	0 Missing and 1 partial ⚠️
...n/java/org/apache/hugegraph/StandardHugeGraph.java	0.00%	0 Missing and 1 partial ⚠️

❗ There is a different number of reports uploaded between BASE (37be6cd) and HEAD (5416720). Click for more details.

HEAD has 3 uploads less than BASE

Flag BASE (37be6cd) HEAD (5416720)

4 1

Additional details and impacted files

@@              Coverage Diff              @@
##             master    #2937       +/-   ##
=============================================
- Coverage     39.38%   29.23%   -10.16%     
+ Complexity      456      264      -192     
=============================================
  Files           812      801       -11     
  Lines         68660    67341     -1319     
  Branches       8968     8734      -234     
=============================================
- Hits          27044    19689     -7355     
- Misses        38824    45423     +6599     
+ Partials       2792     2229      -563

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

… into fix/scheduler

Copilot

Pull request overview

This PR disables the master-worker scheduling logic in StandardTaskScheduler, transitioning to a simplified task execution model. The scheduler type is now auto-determined based on storage backend (distributed for hstore, local otherwise), and server.id is auto-generated if not specified.

Changes:

Removed SCHEDULER_TYPE configuration option and master-role scheduling logic from TaskManager and StandardTaskScheduler
Removed server.id and server.role configuration requirements from rest-server.properties
Added auto-generation of server.id using UUID when not explicitly configured

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
CoreOptions.java (hugegraph-struct)	Removed SCHEDULER_TYPE configuration option
CoreOptions.java (hugegraph-core)	Removed SCHEDULER_TYPE configuration option
TaskCoreTest.java	Updated tests with conditional assertions for DistributedTaskScheduler and code formatting changes
rest-server.properties	Removed server.id and server.role configuration entries
hugegraph.properties	Removed task.scheduler_type configuration entry
hstore.properties.template	Removed task.scheduler_type configuration entry
TaskManager.java	Removed scheduling thread pool and master/worker role-based scheduling logic
StandardTaskScheduler.java	Simplified to remove multi-node task distribution, server filtering, and load balancing
ServerInfoManager.java	Removed pickWorkerNode, updateServerInfos, and load management methods
HugeServerInfo.java	Removed suitableFor method for task-server matching
StandardRoleListener.java	Removed enableRoleElection call during initialization
GlobalMasterInfo.java	Updated TODO comment
StandardHugeGraph.java	Changed schedulerType to be dynamically determined based on backend type
GraphManager.java	Added auto-generation of server.id and removed validation requirements
ServerOptions.java	Changed SERVER_ID default to empty string and updated description
hugegraph.properties.template	Removed task.scheduler_type from cluster test configuration
server2-conf/hugegraph.properties	Removed task.scheduler_type from Docker configuration

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

hugegraph-server/hugegraph-test/src/main/java/org/apache/hugegraph/core/TaskCoreTest.java

Copilot · 2026-01-12T06:26:14Z

...aph-server/hugegraph-core/src/main/java/org/apache/hugegraph/task/StandardTaskScheduler.java

+        HugeTask<?> memTask = this.tasks.get(task.id());
+        if (memTask != null) {
+            boolean cancelled = memTask.cancel(true);
+            LOG.info("Task '{}' cancel result: {}", task.id(), cancelled);


The cancel method no longer saves the task to storage when cancelling a task that's in memory. If the task is running in memory and gets cancelled via memTask.cancel(true), the cancelled status won't be persisted to storage until the task naturally completes. This could cause issues if the server restarts before the task finishes - the task would be restored and re-executed. Consider saving the task status to storage after successfully calling memTask.cancel(true).

Suggested change

LOG.info("Task '{}' cancel result: {}", task.id(), cancelled);

LOG.info("Task '{}' cancel result: {}", task.id(), cancelled);

if (cancelled) {

this.save(memTask);

}

this.callable.cancelled()会负责调用 taskSchduler 持久化

...aph-server/hugegraph-core/src/main/java/org/apache/hugegraph/task/StandardTaskScheduler.java

hugegraph-server/hugegraph-test/src/main/java/org/apache/hugegraph/core/TaskCoreTest.java

...-server/hugegraph-core/src/main/java/org/apache/hugegraph/task/DistributedTaskScheduler.java

imbajin · 2026-01-16T06:21:11Z

...-server/hugegraph-core/src/main/java/org/apache/hugegraph/task/DistributedTaskScheduler.java

@@ -284,14 +295,41 @@ protected <V> void initTaskParams(HugeTask<V> task) {
        }


‼️ 业务逻辑设计问题 - delete 方法的强制删除逻辑不一致

在 DistributedTaskScheduler.delete() 方法第 286-305 行,删除逻辑与原实现存在重大差异:

原逻辑:

force=false: 设置状态为 DELETING,返回 null

force=true: 直接从数据库删除

新逻辑:

if (!force) { if (!task.completed() && task.status() != TaskStatus.DELETING) { throw new IllegalArgumentException( String.format("Can't delete incomplete task '%s' in status %s, " + "Please try to cancel the task first", id, task.status())); } } return this.deleteFromDB(id);

问题:

移除了 DELETING 状态的设置逻辑,这可能破坏依赖定时清理的代码

非强制删除现在会直接删除完成的任务,而不是先标记为 DELETING

与 StandardTaskScheduler 的实现可能不一致

建议:重新考虑删除流程,保持与原有逻辑的兼容性,或在 PR 描述中明确说明此行为变更

imbajin · 2026-01-16T06:21:14Z

...-server/hugegraph-core/src/main/java/org/apache/hugegraph/task/DistributedTaskScheduler.java

-            LOG.info("cancel task({}) error, task has completed", task.id());
+            task.overwriteStatus(TaskStatus.CANCELLING);
        }
    }


⚠️ 代码可维护性 - TODO 注释应该更具体

第 334 行的 TODO 注释过于模糊:

//todo: serverInfoManager section should be removed in the future. return this.serverManager().close(); //return true;

问题:

未说明为什么要移除 serverInfoManager

未说明移除的时间节点或前提条件

注释掉的代码应该删除,而不是保留

建议:

Suggested change

}

// TODO(issue-XXX): Remove serverInfoManager.close() after migrating to

// pure single-node architecture. Currently kept for backward compatibility.

return this.serverManager().close();

imbajin · 2026-01-16T06:22:52Z

代码审查总结

感谢提交这个重要的架构简化 PR!我已经详细审查了所有变更,除了已经在具体代码行发布的评论外,还有以下关键问题需要关注:

‼️ 高优先级问题

1. Server ID 自动生成机制存在缺陷

在 GraphManager 构造函数中:

server = "server-" + UUID.randomUUID().toString().substring(0, 8);

问题:

每次重启会生成新的 ID,导致数据库残留旧的 ServerInfo 记录
截取 UUID 前 8 位增加碰撞风险
生成的 ID 未持久化,难以追踪和调试

建议: 使用 hostname + processId 生成稳定 ID,或将 UUID 长度增加到 16 位。

2. 调度器类型推断逻辑过于简化

在 StandardHugeGraph.schedulerType():

return StandardHugeGraph.this.isHstore() ? "distributed" : "local";

问题:

仅根据 backend 类型判断,无法覆盖 RocksDB 分布式部署场景
未检查 pd.peers 等分布式配置

建议: 结合 PD 配置或增加显式配置项来判断调度模式。

3. 任务删除逻辑发生重大变更

DistributedTaskScheduler.delete() 移除了 DELETING 状态的设置逻辑:

原逻辑: force=false 设置为 DELETING 状态,由定时任务清理
新逻辑: force=false 时直接删除完成的任务

影响: 可能破坏依赖定时清理的代码路径,需在 PR 描述中明确说明此行为变更。

⚠️ 中等优先级问题

4. 线程安全隐患

DistributedTaskScheduler.cancel() 方法缺少对 runningTasks 并发访问的同步保护,可能导致任务在检查和取消之间状态变化。

5. 异常处理不完整

TaskAndResultScheduler.taskWithoutResult() 新增了 NotFoundException,但调用方未统一处理该异常,可能导致不一致的错误行为。

6. Ephemeral Job 行为变更未说明

测试用例 testEphemeralJob() 的修改表明 Ephemeral Job 不再持久化到数据库,这是重大的 API 行为变更,需要:

在 PR 描述中明确说明
更新用户文档
提供迁移指南

7. 资源关闭顺序问题

DistributedTaskScheduler.close() 中捕获了 InterruptedException 但未恢复线程中断状态:

} catch (ExecutionException | InterruptedException e) {
    LOG.warn("Exception while waiting for cron task to complete", e);
    // 应该添加: if (e instanceof InterruptedException) Thread.currentThread().interrupt();
}

🧹 代码质量建议

TODO 注释应更明确: 第 334 行的 TODO 缺少问题编号和时间节点
测试用例修改需要说明: MultiGraphsTest.testCreateGraphsWithSameName() 修改图名称的原因未说明
代码覆盖率下降: Codecov 报告显示 DistributedTaskScheduler 新增代码覆盖率仅 0%

📋 建议的后续行动

为 Server ID 生成添加稳定性保证机制
改进调度器类型推断逻辑,支持更多部署场景
明确 API 行为变更并更新文档
增加新增代码路径的单元测试覆盖
考虑添加配置迁移指南

总体而言,这是一个很好的架构简化尝试,但需要解决上述关键问题以确保生产环境的稳定性和可维护性。

Copilot

Pull request overview

Copilot reviewed 20 out of 20 changed files in this pull request and generated 12 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

hugegraph-server/hugegraph-test/src/main/java/org/apache/hugegraph/core/TaskCoreTest.java

...-server/hugegraph-core/src/main/java/org/apache/hugegraph/task/DistributedTaskScheduler.java

Copilot · 2026-01-19T03:33:08Z

hugegraph-server/hugegraph-core/src/main/java/org/apache/hugegraph/StandardHugeGraph.java

        @Override
        public String schedulerType() {
-            return StandardHugeGraph.this.schedulerType;
+            // Use distributed scheduler for hstore backend, otherwise use local


The schedulerType() method now determines the scheduler type based on whether the backend is hstore. However, this logic change is undocumented. Consider adding a comment explaining why hstore backends require distributed scheduling while other backends use local scheduling, as this is an important architectural decision.

Suggested change

// Use distributed scheduler for hstore backend, otherwise use local

/*

* HStore is a distributed backend: data and tasks may be handled by

* multiple graph servers that must coordinate scheduling and state.

* For this reason we require a distributed task scheduler when the

* backend is hstore so that jobs can be balanced and recovered

* across nodes. For other backends, the graph is served by a single

* server instance and tasks are executed locally, so a local

* in-process scheduler is sufficient and avoids the overhead of

* distributed coordination.

*/

...-server/hugegraph-core/src/main/java/org/apache/hugegraph/task/DistributedTaskScheduler.java

...aph-server/hugegraph-core/src/main/java/org/apache/hugegraph/task/StandardTaskScheduler.java

…hugegraph into fix/scheduler

This reverts commit 28e0390.

Copilot

Pull request overview

Copilot reviewed 20 out of 20 changed files in this pull request and generated 7 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

hugegraph-server/hugegraph-api/src/main/java/org/apache/hugegraph/core/GraphManager.java

Copilot · 2026-02-13T13:08:34Z

hugegraph-server/hugegraph-api/src/main/java/org/apache/hugegraph/core/GraphManager.java

+        // Transfer `pd.peers` from server config to graph config
+        // Only inject if not already configured in graph config
+        if (!config.containsKey("pd.peers")) {


The pd.peers configuration is being transferred from server config to graph config (lines 1571-1574), but this happens unconditionally for all graphs during loadGraph. This means even graphs that don't need PD will get this configuration injected. Consider checking if the graph actually uses PD (e.g., by checking if it's an hstore backend) before injecting this configuration, or document why all graphs need this configuration.

Suggested change

// Transfer `pd.peers` from server config to graph config

// Only inject if not already configured in graph config

if (!config.containsKey("pd.peers")) {

// Transfer `pd.peers` from server config to graph config when PD is used

// Only inject if not already configured in graph config and backend uses PD

String backend = config.get(CoreOptions.BACKEND);

boolean backendUsesPd = "hstore".equalsIgnoreCase(backend);

if (backendUsesPd && !config.containsKey("pd.peers")) {

...-server/hugegraph-core/src/main/java/org/apache/hugegraph/task/DistributedTaskScheduler.java

hugegraph-server/hugegraph-api/src/main/java/org/apache/hugegraph/core/GraphManager.java

...-server/hugegraph-core/src/main/java/org/apache/hugegraph/task/DistributedTaskScheduler.java

hugegraph-server/hugegraph-api/src/main/java/org/apache/hugegraph/core/GraphManager.java

imbajin · 2026-02-13T13:24:51Z

...-server/hugegraph-core/src/main/java/org/apache/hugegraph/task/DistributedTaskScheduler.java

+        // Task not running locally, update status to CANCELLING
+        // for cronSchedule() or other nodes to handle
+        TaskStatus currentStatus = task.status();
+        if (!this.updateStatus(task.id(), currentStatus, TaskStatus.CANCELLING)) {


‼️ 这里使用 currentStatus 做 CAS 更新有明显竞态：调用方读取任务后到执行 cancel() 之间，任务可能从 QUEUED 变成 RUNNING，导致 updateStatus(..., currentStatus, CANCELLING) 失败并直接返回，取消请求被静默丢弃。建议 CAS 失败后读取最新状态并做一次兜底更新（只要未完成就转 CANCELLING）。

Suggested change

if (!this.updateStatus(task.id(), currentStatus, TaskStatus.CANCELLING)) {

if (!this.updateStatus(task.id(), currentStatus, TaskStatus.CANCELLING)) {

HugeTask<?> latest = this.taskWithoutResult(task.id());

if (!latest.completed() && !latest.cancelling()) {

this.updateStatus(task.id(), null, TaskStatus.CANCELLING);

task.overwriteStatus(TaskStatus.CANCELLING);

}

} else {

task.overwriteStatus(TaskStatus.CANCELLING);

}

imbajin · 2026-02-13T13:24:52Z

...aph-server/hugegraph-core/src/main/java/org/apache/hugegraph/task/StandardTaskScheduler.java

-
-            return task;
-        }
+        task.status(TaskStatus.QUEUED);


‼️ 这里将普通任务直接置为 QUEUED 并提交，但不再写入 task.server。结合 restoreTasks() 取消按 server 过滤后，在共享后端的多实例场景（如 MySQL/PostgreSQL/Cassandra）重启时，每个节点都可能恢复并执行同一批 pending 任务，产生重复执行。建议保留 owner 绑定，至少让恢复阶段可按 owner 做隔离。

Suggested change

task.status(TaskStatus.QUEUED);

task.status(TaskStatus.QUEUED);

task.server(this.serverManager().selfNodeId());

this.save(task);

return this.submitTask(task);

imbajin · 2026-02-13T13:26:09Z

docker/configs/server2-conf/graphs/hugegraph.properties


 # task config
-task.scheduler_type=local
 task.schedule_period=10


⚠️ 这里已经移除了 task.scheduler_type，但同目录的 server1-conf / server3-conf 仍保留该配置项。由于本 PR 同时删除了 CoreOptions.SCHEDULER_TYPE，这会导致示例配置不一致，并在启动时产生冗余配置告警。建议同步清理另外两个 docker 配置文件，避免误导运维配置。

imbajin · 2026-02-13T13:43:28Z

...-server/hugegraph-core/src/main/java/org/apache/hugegraph/task/DistributedTaskScheduler.java

+
        // Process schema task
        // Handle gremlin task
        // Handle OLAP calculation tasks


⚠️ 并发安全问题: 缺少 volatile 或同步保护

在 DistributedTaskScheduler.java:268-270:

if (this.closed.get()) { return; }

问题:
虽然添加了 closed 检查,但 this.graph.started() 和 this.graph.closed() 在后续调用时可能已经改变,存在 TOCTOU (Time-of-check to time-of-use) 问题。

建议:

Suggested change

// Handle OLAP calculation tasks

if (this.closed.get() || !this.graph.started() || this.graph.closed()) {

return;

}

将所有状态检查合并到一个 if 语句中,减少竞态窗口。

imbajin · 2026-02-13T13:43:29Z

...-server/hugegraph-core/src/main/java/org/apache/hugegraph/task/DistributedTaskScheduler.java

-        } else {
-            return this.deleteFromDB(id);
+            // Already in DELETING status, delete directly from DB
+            // Completed tasks can also be deleted directly


‼️ 任务删除逻辑错误

在 DistributedTaskScheduler.java:355-366:

if (!force && !task.completed()) { this.updateStatus(id, null, TaskStatus.DELETING); return null; // 下面的代码永远无法执行 }

问题:

Line 360 的注释 "Already in DELETING status" 放在了 return null 之后,永远无法到达

逻辑不清晰:应该区分「未完成的任务」和「已经是 DELETING 状态的任务」

建议:

Suggested change

// Completed tasks can also be deleted directly

HugeTask<?> task = this.taskWithoutResult(id);

if (!force && !task.completed()) {

// Non-force mode: mark incomplete tasks as DELETING for async cleanup

this.updateStatus(id, null, TaskStatus.DELETING);

return null;

}

// Force mode OR task is completed OR already DELETING: delete directly

return this.deleteFromDB(id);

imbajin · 2026-02-13T13:43:31Z

...-server/hugegraph-core/src/main/java/org/apache/hugegraph/task/DistributedTaskScheduler.java

+        } catch (ExecutionException | InterruptedException e) {
+            LOG.warn("Exception while waiting for cron task to complete", e);
+        }
+


⚠️ 等待超时可能导致数据库事务泄漏

在 DistributedTaskScheduler.java:399-410:

cronFuture.get(schedulePeriod + 5, TimeUnit.SECONDS);

问题:

如果 cron 任务超时,后续的 taskDbExecutor 清理可能在 cron 任务仍持有事务时执行

注释说 "ensure all transactions are closed",但超时时无法保证

schedulePeriod 可能很大(如 60 秒),导致关闭等待过长

建议:

Suggested change

// Wait for cron task with reasonable timeout

long waitTime = Math.min(schedulePeriod + 5, 15); // Cap at 15 seconds

try {

cronFuture.get(waitTime, TimeUnit.SECONDS);

} catch (CancellationException e) {

LOG.debug("Cron task was cancelled");

} catch (TimeoutException e) {

LOG.warn("Cron task did not complete in {}s, proceeding with shutdown", waitTime);

// Force interrupt the cron task thread if possible

cronFuture.cancel(true);

} catch (ExecutionException | InterruptedException e) {

LOG.warn("Exception while waiting for cron task to complete", e);

}

imbajin · 2026-02-13T13:43:32Z

...-server/hugegraph-core/src/main/java/org/apache/hugegraph/task/DistributedTaskScheduler.java

+
+        //todo: serverInfoManager section should be removed in the future.
+        return this.serverManager().close();
+        //return true;


🧹 TODO 注释应该追踪到 Issue

在 DistributedTaskScheduler.java:422-424:

//todo: serverInfoManager section should be removed in the future. return this.serverManager().close(); //return true;

建议:
将 TODO 关联到 GitHub Issue,方便追踪:

Suggested change

//return true;

// TODO(#issue-number): Remove serverInfoManager completely after full single-node migration

return this.serverManager().close();

并删除注释掉的代码 //return true;

imbajin · 2026-02-13T13:43:34Z

...-server/hugegraph-core/src/main/java/org/apache/hugegraph/task/DistributedTaskScheduler.java

+        // for cronSchedule() or other nodes to handle
+        TaskStatus currentStatus = task.status();
+        if (!this.updateStatus(task.id(), currentStatus, TaskStatus.CANCELLING)) {
+            LOG.info("Failed to cancel task '{}', status may have changed from {}",


‼️ 状态更新失败后的任务状态不一致

在 DistributedTaskScheduler.java:323-330:

if (!this.updateStatus(task.id(), currentStatus, TaskStatus.CANCELLING)) { LOG.info("Failed to cancel task '{}', status may have changed from {}", task.id(), currentStatus); } else { task.overwriteStatus(TaskStatus.CANCELLING); }

问题:

当 updateStatus 返回 false 时(数据库更新失败),内存中的 task 对象状态未同步

调用者可能认为 cancel 成功,但数据库中状态未改变

缺少错误处理:应该重新读取最新状态或抛出异常

建议:

Suggested change

LOG.info("Failed to cancel task '{}', status may have changed from {}",

TaskStatus currentStatus = task.status();

if (!this.updateStatus(task.id(), currentStatus, TaskStatus.CANCELLING)) {

// Status changed concurrently, reload from DB

HugeTask<?> latestTask = this.taskWithoutResult(task.id());

LOG.info("Failed to cancel task '{}': status changed from {} to {}",

task.id(), currentStatus, latestTask.status());

task.overwriteStatus(latestTask.status());

} else {

task.overwriteStatus(TaskStatus.CANCELLING);

}

imbajin

Code Review Summary (中文)

我已经完成了对 PR #2937 的详细代码审查。以下是关键发现:

‼️ 核心逻辑问题 (High Priority)

1. server.id 自动生成可能导致任务恢复失败

位置: GraphManager.java:72-76

当 server.id 为空时会自动生成随机 UUID:

server = "server-" + UUID.randomUUID().toString().substring(0, 8);

问题:

服务重启后会生成新的 server.id,导致之前分配给该节点的任务无法恢复
StandardTaskScheduler.restoreTasks() 中移除了 selfServer.equals(task.server()) 过滤,现在会恢复所有任务,这与分布式场景不兼容

建议: 如果彻底移除多节点调度,应同时从 HugeTask 中移除 server 字段;或者基于持久化标识生成稳定的 server.id

2. 异常类型选择不当

位置: GraphManager.java:1131

原代码抛出 NotSupportException 表示类型转换失败,新代码改为 NotFoundException,但实际上图可能存在只是类型不匹配。

建议: 保持使用 NotSupportException,或改为更明确的异常信息说明是类型不匹配问题。

3. 调度器类型判断逻辑不完善

位置: StandardHugeGraph.java:1177-1178

return StandardHugeGraph.this.isHstore() ? "distributed" : "local";

问题:

仅基于 backend 类型判断,未考虑多节点 RocksDB 集群场景(通过 PD 协调)
GraphManager.usePD() 方法已可见,应优先使用它判断

建议: 改为 return graphManager.usePD() ? "distributed" : "local";

⚠️ 重要问题 (Medium Priority)

4. 任务删除逻辑错误

位置: DistributedTaskScheduler.java:355-366

Line 360 的注释 "Already in DELETING status" 放在了 return null 之后,永远无法到达。逻辑不清晰,应该明确区分「未完成的任务」和「已经是 DELETING 状态的任务」。

5. 状态更新失败后的任务状态不一致

位置: DistributedTaskScheduler.cancel()

当 updateStatus 返回 false 时(数据库更新失败),内存中的 task 对象状态未同步。应该重新读取最新状态或抛出异常。

6. 等待超时可能导致事务泄漏

位置: DistributedTaskScheduler.close():399-410

如果 cron 任务超时,后续的清理可能在 cron 任务仍持有事务时执行。schedulePeriod 可能很大,建议限制最大等待时间。

7. 测试用例修改掩盖了潜在问题

位置: TaskCoreTest.java:1290-1296

原本期望状态为 CANCELLING,现在放宽为 "CANCELLING 或 CANCELLED 都行"。这种不确定性可能导致竞态条件难以调试。

🧹 代码质量建议 (Low Priority)

8. 配置注入可能覆盖用户设置

位置: GraphManager.java:1105-1111

应该检查 pdPeers 是否有效再注入:

if (!config.containsKey("pd.peers")) {
    String pdPeers = this.conf.get(ServerOptions.PD_PEERS);
    if (StringUtils.isNotEmpty(pdPeers)) {
        config.addProperty("pd.peers", pdPeers);
    }
}

9. TODO 注释应该追踪到 Issue

将 TODO 关联到 GitHub Issue,方便追踪;删除注释掉的代码。

总结

这个 PR 的核心改动是将任务调度从多节点模式简化为单节点模式,但存在以下关键风险:

向后兼容性: server.id 随机生成会导致重启后任务恢复失败
调度器选择: 应该基于 PD 存在性而非 backend 类型判断
状态一致性: 多处状态更新缺少失败后的同步处理
测试覆盖: 测试断言被放宽,可能掩盖并发问题

建议在合并前重点关注第 1-3 个核心逻辑问题。

Tsukilc added 18 commits November 6, 2025 18:53

fix(server): disable server-role in StandardTaskScheduler

b4321eb

fix(server): disable server-role in StandardTaskScheduler

d19096d

fix(server): disable server-role in StandardTaskScheduler

a32901a

fix(server): disable server-role in StandardTaskScheduler

42ee2ba

fix(server): disable server-role in StandardTaskScheduler

c976caa

fix(server): disable server-role in StandardTaskScheduler

570d670

fix(server): disable server-role in StandardTaskScheduler

5915428

fix(server): disable server-role in StandardTaskScheduler

e7da7e5

fix(server): disable server-role in StandardTaskScheduler

b0f381c

fix(server): disable server-role in StandardTaskScheduler

804055d

fix(server): disable server-role in StandardTaskScheduler

2e8578e

fix(server): disable server-role in StandardTaskScheduler

2c44cee

fix(server): fix npe in non-auth mode

925c384

fix(server): fix npe in non-auth mode

3974048

fix(server): fix npe in non-auth mode

a349f62

fix(server): remove server.id

3e7bc6f

Merge branch 'fix/scheduler' of https://github.com/hugegraph/hugegraph …

e6cc98b

…into fix/scheduler

fix(server): remove task.scheduler_type

e6f6487

dosubot bot added the size:XL This PR changes 500-999 lines, ignoring generated files. label Jan 11, 2026

fix(server): remove task.scheduler_type

b325dba

Merge branch 'master' of https://github.com/Tsukilc/incubator-hugegraph…

68b906a

… into fix/scheduler

Tsukilc force-pushed the fix/scheduler branch from 47ec466 to 68b906a Compare January 11, 2026 13:08

imbajin requested a review from Copilot January 12, 2026 06:17

Copilot started reviewing on behalf of imbajin January 12, 2026 06:17 View session

Copilot AI reviewed Jan 12, 2026

View reviewed changes

fix(server): fix some issues of the distributed scheduler

1113520

dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. and removed size:XL This PR changes 500-999 lines, ignoring generated files. labels Jan 15, 2026

fix(server): fix some issues of the distributed scheduler

5ffd20b

Tsukilc force-pushed the fix/scheduler branch from 5416720 to f8fc58a Compare January 16, 2026 02:16

Tsukilc changed the title ~~fix(server): disable server-role in StandardTaskScheduler~~ fix(server): fix the scheduler and the scheduler selection logic Jan 16, 2026

Tsukilc added 2 commits January 16, 2026 10:45

fix(server): fix some issues of the distributed scheduler

6dd52e4

fix(server): fix some issues of the distributed scheduler

af85bef

imbajin reviewed Jan 16, 2026

View reviewed changes

...-server/hugegraph-core/src/main/java/org/apache/hugegraph/task/DistributedTaskScheduler.java Show resolved Hide resolved

imbajin reviewed Jan 16, 2026

View reviewed changes

fix(server): fix some issues of the distributed scheduler

b332674

Tsukilc requested review from Copilot and imbajin January 19, 2026 03:23

Copilot started reviewing on behalf of Tsukilc January 19, 2026 03:23 View session

Copilot AI reviewed Jan 19, 2026

View reviewed changes

imbajin and others added 5 commits February 3, 2026 16:58

Merge branch 'master' into pr/2937

cac78f4

fix(server): fix some issues of the distributed scheduler

28e0390

Merge branch 'fix/scheduler' of https://github.com/Tsukilc/incubator-…

5e30cac

…hugegraph into fix/scheduler

Revert "fix(server): fix some issues of the distributed scheduler"

b70788f

This reverts commit 28e0390.

fix(server): fix some issues of the distributed scheduler

7ba40bd

imbajin requested a review from Copilot February 13, 2026 12:59

Copilot started reviewing on behalf of imbajin February 13, 2026 12:59 View session

Copilot AI reviewed Feb 13, 2026

View reviewed changes

imbajin reviewed Feb 13, 2026

View reviewed changes

		@@ -284,14 +295,41 @@ protected <V> void initTaskParams(HugeTask<V> task) {
		}

-    }
+// TODO(issue-XXX): Remove serverInfoManager.close() after migrating to
+// pure single-node architecture. Currently kept for backward compatibility.
+return this.serverManager().close();

-            // Use distributed scheduler for hstore backend, otherwise use local
+            /*
+             * HStore is a distributed backend: data and tasks may be handled by
+             * multiple graph servers that must coordinate scheduling and state.
+             * For this reason we require a distributed task scheduler when the
+             * backend is hstore so that jobs can be balanced and recovered
+             * across nodes. For other backends, the graph is served by a single
+             * server instance and tasks are executed locally, so a local
+             * in-process scheduler is sufficient and avoids the overhead of
+             * distributed coordination.
+             */

-        // Transfer `pd.peers` from server config to graph config
-        // Only inject if not already configured in graph config
-        if (!config.containsKey("pd.peers")) {
+        // Transfer `pd.peers` from server config to graph config when PD is used
+        // Only inject if not already configured in graph config and backend uses PD
+        String backend = config.get(CoreOptions.BACKEND);
+        boolean backendUsesPd = "hstore".equalsIgnoreCase(backend);
+        if (backendUsesPd && !config.containsKey("pd.peers")) {

-        task.status(TaskStatus.QUEUED);
+task.status(TaskStatus.QUEUED);
+task.server(this.serverManager().selfNodeId());
+this.save(task);
+return this.submitTask(task);

-            // Completed tasks can also be deleted directly
+HugeTask<?> task = this.taskWithoutResult(id);
+if (!force && !task.completed()) {
+    // Non-force mode: mark incomplete tasks as DELETING for async cleanup
+    this.updateStatus(id, null, TaskStatus.DELETING);
+    return null;
+}
+// Force mode OR task is completed OR already DELETING: delete directly
+return this.deleteFromDB(id);

+// Wait for cron task with reasonable timeout
+long waitTime = Math.min(schedulePeriod + 5, 15); // Cap at 15 seconds
+try {
+    cronFuture.get(waitTime, TimeUnit.SECONDS);
+} catch (CancellationException e) {
+    LOG.debug("Cron task was cancelled");
+} catch (TimeoutException e) {
+    LOG.warn("Cron task did not complete in {}s, proceeding with shutdown", waitTime);
+    // Force interrupt the cron task thread if possible
+    cronFuture.cancel(true);
+} catch (ExecutionException | InterruptedException e) {
+    LOG.warn("Exception while waiting for cron task to complete", e);
+}

	//return true;
	// TODO(#issue-number): Remove serverInfoManager completely after full single-node migration
	return this.serverManager().close();

Conversation

Tsukilc commented Jan 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

New Features

Refactoring

Configuration Changes

Bug Fixes

Tests

Uh oh!

codecov bot commented Jan 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI Jan 12, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

imbajin commented Jan 16, 2026

代码审查总结

‼️ 高优先级问题

1. Server ID 自动生成机制存在缺陷

2. 调度器类型推断逻辑过于简化

3. 任务删除逻辑发生重大变更

⚠️ 中等优先级问题

4. 线程安全隐患

5. 异常处理不完整

6. Ephemeral Job 行为变更未说明

7. 资源关闭顺序问题

🧹 代码质量建议

📋 建议的后续行动

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Tsukilc commented Jan 11, 2026 •

edited

Loading

codecov bot commented Jan 11, 2026 •

edited

Loading