-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Open
Description
Describe the bug
onLeaderStop is invoked after the state machine task has executed.
2025-09-17 22:56:18.606 ERROR --- [ServerHandlerThread_1_44_500] [org.apache.seata.core.exception.AbstractExceptionHandler] [exceptionHandleTemplate] [10.133.113.213:8091:5278906257036056860]: Catch TransactionException while do RPC, request: GlobalBeginRequest{transactionName='default', timeout=300000}
==>
org.apache.seata.core.exception.TransactionException: seata raft state machine exception: Is not leader.
at org.apache.seata.server.storage.raft.session.RaftSessionManager.lambda$onBegin$0(RaftSessionManager.java:82)
at com.alipay.sofa.jraft.core.NodeImpl.lambda$executeApplyingTasks$3(NodeImpl.java:1378)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at java.base/java.lang.Thread.run(Thread.java:1583)
最早在这个时间点提交task到状态机时,旧leader已经被拒绝了,closure中的status isOk是false的
@Override
public void onBegin(GlobalSession globalSession) throws TransactionException {
globalSession.checkSize();
CompletableFuture<Boolean> completableFuture = new CompletableFuture<>();
Closure closure = status -> {
if (status.isOk()) {
try {
super.addGlobalSession(globalSession);
completableFuture.complete(true);
} catch (TransactionException e) {
completableFuture.completeExceptionally(e);
}
} else {
try {
completableFuture.completeExceptionally(new TransactionException(
TransactionExceptionCode.NotRaftLeader,
"seata raft state machine exception: " + status.getErrorMsg()));
} finally {
try {
removeGlobalSession(globalSession);
} catch (TransactionException e) {
completableFuture.completeExceptionally(e);
}
}
}
};
GlobalTransactionDTO globalTransactionDTO = new GlobalTransactionDTO();
SessionConverter.convertGlobalTransactionDO(globalTransactionDTO, globalSession);
RaftGlobalSessionSyncMsg raftSyncMsg = new RaftGlobalSessionSyncMsg(ADD_GLOBAL_SESSION, globalTransactionDTO);
RaftTaskUtil.createTask(closure, raftSyncMsg, completableFuture);
}
但在所有的日志中无法检索到onLeaderStop的输出
@Override
public void onLeaderStop(final Status status) {
this.leaderTerm.set(-1);
LOGGER.info("groupId: {}, onLeaderStop: status={}.", group, status);
}
@Override
public void onStopFollowing(final LeaderChangeContext ctx) {
LOGGER.info("groupId: {}, onStopFollowing: {}.", group, ctx);
}
@Override
public void onStartFollowing(final LeaderChangeContext ctx) {
LOGGER.info("groupId: {}, onStartFollowing: {}.", group, ctx);
this.currentTerm.set(ctx.getTerm());
CompletableFuture.runAsync(() -> syncCurrentNodeInfo(ctx.getLeaderId()), RESYNC_METADATA_POOL);
}
在56分37秒的时候日志发生了变化,也就是新leader选举成功了,所以closure为null,状态机中执行了反序列化,通过错误日志能很好分别,seata raft state machine exception: Is not leader.是在closure中执行的,说明当前还认为自己是leader,提交了task到状态机。而第二段异常明显堆栈里已经是状态机开始执行的,运行的线程也不相同。但是在这个期间并没有出现onLeaderStop,导致依靠leaderTerm进行判断的逻辑就出现问题,从而导致状态机中逻辑错乱,节点无法恢复,只能靠重启来进行回复。
2025-09-17 22:56:37.058 ERROR --- [ForkJoinPool.commonPool-worker-222] [org.apache.seata.server.coordinator.DefaultCoordinator] [lambda$handleAsyncCommitting$4] [10.133.113.213:8091:5278906257034548660]: Failed to async committing [10.133.113.213:8091:5278906257034548660] NotRaftLeader seata raft state machine exception: Is not leader.
==>
org.apache.seata.core.exception.TransactionException: seata raft state machine exception: Is not leader.
at org.apache.seata.server.storage.raft.session.RaftSessionManager.lambda$onStatusChange$2(RaftSessionManager.java:131)
at com.alipay.sofa.jraft.core.NodeImpl.lambda$executeApplyingTasks$3(NodeImpl.java:1378)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at java.base/java.lang.Thread.run(Thread.java:1583)
<==
2025-09-17 22:56:37.058 ERROR --- [JRaft-FSMCaller-Disruptor-0] [org.apache.seata.server.cluster.raft.RaftStateMachine] [onExecuteRaft] []: Message synchronization failure: Cannot invoke "org.apache.seata.server.cluster.raft.RaftServer.getNode()" because the return value of "org.apache.seata.server.cluster.raft.RaftServerManager.getRaftServer(String)" is null, msgType: ADD_BRANCH_SESSION
==>
java.lang.NullPointerException: Cannot invoke "org.apache.seata.server.cluster.raft.RaftServer.getNode()" because the return value of "org.apache.seata.server.cluster.raft.RaftServerManager.getRaftServer(String)" is null
at org.apache.seata.server.cluster.raft.util.RaftTaskUtil.createTask(RaftTaskUtil.java:51)
at org.apache.seata.server.storage.raft.lock.RaftLockManager.releaseLock(RaftLockManager.java:81)
at org.apache.seata.server.session.BranchSession.unlock(BranchSession.java:308)
at org.apache.seata.server.storage.file.lock.FileLocker.acquireLock(FileLocker.java:104)
at org.apache.seata.server.lock.AbstractLockManager.acquireLock(AbstractLockManager.java:65)
at org.apache.seata.server.session.BranchSession.lock(BranchSession.java:300)
at org.apache.seata.server.session.BranchSession.lock(BranchSession.java:295)
at org.apache.seata.server.cluster.raft.execute.branch.AddBranchSessionExecute.execute(AddBranchSessionExecute.java:49)
at org.apache.seata.server.cluster.raft.execute.branch.AddBranchSessionExecute.execute(AddBranchSessionExecute.java:31)
at org.apache.seata.server.cluster.raft.RaftStateMachine.onExecuteRaft(RaftStateMachine.java:333)
at org.apache.seata.server.cluster.raft.RaftStateMachine.onApply(RaftStateMachine.java:174)
at com.alipay.sofa.jraft.core.FSMCallerImpl.doApplyTasks(FSMCallerImpl.java:597)
at com.alipay.sofa.jraft.core.FSMCallerImpl.doCommitted(FSMCallerImpl.java:561)
at com.alipay.sofa.jraft.core.FSMCallerImpl.runApplyTask(FSMCallerImpl.java:467)
at com.alipay.sofa.jraft.core.FSMCallerImpl.access$100(FSMCallerImpl.java:73)
at com.alipay.sofa.jraft.core.FSMCallerImpl$ApplyTaskHandler.onEvent(FSMCallerImpl.java:150)
at com.alipay.sofa.jraft.core.FSMCallerImpl$ApplyTaskHandler.onEvent(FSMCallerImpl.java:142)
at com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:137)
at java.base/java.lang.Thread.run(Thread.java:1583)
<==
由于当时压测时产生了fullgc长达秒级,超过了选举的心态间隔。并且当时leader的cpu也已经被打满。
猜测原因是fullgc/cpu打满后,新leader产生了,在这个fullgc结束的间隔中,新leader接收了大量请求并提交至状态机,老leader fullgc完成后收到了大量任务在状态机中进行回放,但是可能由于onLeaderStop在这些task之后才被触发,导致了老leader依然认为自己是leader,从而导致状态机失败,节点直接异常无法恢复。
Expected behavior
Actual behavior
Steps to reproduce
Minimal yet complete reproducer code (or GitHub URL to code)
Environment
- SOFAJRaft version: 1.3.14
- JVM version (e.g.
java -version): jdk8 - OS version (e.g.
uname -a): linux - Maven version:
- IDE version:
Metadata
Metadata
Assignees
Labels
No labels