HDDS-14646. SCM should not close Ratis pipelines on Finalize by sodonnel · Pull Request #9779 · apache/ozone

sodonnel · 2026-02-17T16:58:10Z

What changes were proposed in this pull request?

When SCM finalizes an upgrade, it should no longer close all the Ratis pipelines on the datanodes. Instead they should be kept open. The SCM finalize command now needs to wait for all healthy datanodes to report matching SLV and MLV versions to know they have been finalized. Only when all datanodes are finalized, should SCM complete the finalization process.

This change is mostly to remove the existing close pipeline code and anything else that depended on it. The only new code added is the change to wait in DNs reporting SVL == MVL, as before new pipelines being created was the trigger to complete the process.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-14646

How was this patch tested?

Existing tests, some of which had to be adapted slightly.

yandrey321 · 2026-02-17T20:52:04Z

.../server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/upgrade/SCMUpgradeFinalizer.java

+                "datanodes have finalized ({} remaining).",
+            finalizedNodes, totalHealthyNodes, unfinalizedNodes);
        try {
          Thread.sleep(5000);


should decrease sleep interval

yandrey321 · 2026-02-17T20:53:25Z

.../server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/upgrade/SCMUpgradeFinalizer.java

+      if (!allDatanodesFinalized) {
+        LOG.info("Waiting for datanodes to finalize. Status: {}/{} healthy " +
+                "datanodes have finalized ({} remaining).",
+            finalizedNodes, totalHealthyNodes, unfinalizedNodes);


its would be nice to have a ids for unhealthy nodes and unfinalizedNodes. Also it would be nice to have a time in wait state.

… all finalize before finalization is complete

sodonnel · 2026-02-18T16:18:23Z

The Integration tests seem to fail with:

2026-02-18 10:03:04,129 [main] ERROR volume.MutableVolumeSet (MutableVolumeSet.java:initializeVolumeSet(193)) - Failed to parse the storage location: file:///home/runner/work/ozone/ozone/hadoop-hdds/container-service/target/tmp/dfs/data
org.apache.hadoop.ozone.common.InconsistentStorageStateException: Mismatched DatanodeUUIDs. Version File : /home/runner/work/ozone/ozone/hadoop-hdds/container-service/target/tmp/dfs/data/hdds/VERSION has datanodeUuid: 37fe7285-51fc-49f0-8ced-b4814363d903 and Datanode has datanodeUuid: f2bcb13b-daa7-4306-9fad-7488b660dff6
	at org.apache.hadoop.ozone.container.common.utils.StorageVolumeUtil.getDatanodeUUID(StorageVolumeUtil.java:122)
	at org.apache.hadoop.ozone.container.common.volume.StorageVolume.readVersionFile(StorageVolume.java:381)
	at org.apache.hadoop.ozone.container.common.volume.StorageVolume.initializeImpl(StorageVolume.java:239)
	at org.apache.hadoop.ozone.container.common.volume.StorageVolume.initialize(StorageVolume.java:214)
	at org.apache.hadoop.ozone.container.common.volume.HddsVolume.<init>(HddsVolume.java:155)
	at org.apache.hadoop.ozone.container.common.volume.HddsVolume.<init>(HddsVolume.java:82)

Resulting in no storage locations available on the DN. This was supposed to be fixed by HDDS-14632, which is on the branch, but its still failing. Passes locally.

EDIT: the failure was from an old run. The latest doesn't fail with this error!

adoroszlai · 2026-02-18T16:28:53Z

This was supposed to be fixed by HDDS-14632, which is on the branch, but its still failing.

It failed for this commit:

HEAD is now at 246e9a4 Merge 9acf575a86b50ff3c59360f1d2d539ec266de1f5 into af22a5c3a435f771c17307d325d7e80eae7361ec

where af22a5c was the target branch state, from two weeks ago.

It passes after feature branch HDDS-14496-zdu was updated to current master.

errose28

Thanks for working on this @sodonnel. I assume we are planning to remove the healthy readonly state and finalization checkpoints in follow up changes. Can you file those Jiras to clarify the scope of this PR?

errose28 · 2026-02-18T23:40:01Z

...ervice/src/main/java/org/apache/hadoop/ozone/container/upgrade/DataNodeUpgradeFinalizer.java

-      logAndEmit(msg);
-      throw new UpgradeException(msg, PREFINALIZE_VALIDATION_FAILED);
-    }
+  public void preFinalizeUpgrade(DatanodeStateMachine dsm) {


We can remove this override, it is the same as the parent.

errose28 · 2026-02-18T23:51:46Z

...cm/src/main/java/org/apache/hadoop/hdds/scm/server/upgrade/FinalizationStateManagerImpl.java

-    // outdated layout information.
-    // This operation is not idempotent.
-    if (checkpoint == FinalizationCheckpoint.MLV_EQUALS_SLV) {
-      upgradeContext.getNodeManager().forceNodesToHealthyReadOnly();


We can actually delete forceNodesToHealthyReadOnly since this was the only caller outside of tests.

errose28 · 2026-02-18T23:58:26Z

.../server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/upgrade/FinalizationManager.java


  void onLeaderReady();

-  static boolean shouldCreateNewPipelines(FinalizationCheckpoint checkpoint) {


crossedCheckpoint can also be removed from this interface since it is now unused.

errose28 · 2026-02-19T00:19:11Z

.../server-scm/src/main/java/org/apache/hadoop/hdds/scm/server/upgrade/SCMUpgradeFinalizer.java

+   * @throws SCMException if waiting is interrupted or SCM loses leadership
+   * @throws NotLeaderException if SCM is no longer the leader
+   */
+  private void waitForDatanodesToFinalize(SCMUpgradeFinalizationContext context)


I don't think we need to block on the server side until all datadnodes finalize. Instead, I think this logic should be async in the heartbeat handling. When a pre-finalized datanode heartbeats to a finalized SCM, it should be instructed to finalize in the heartbeat response. This way we don't need any dedicated threads with interrupts/resumes on leader changes.

When the client instructs SCM to finalize, it can get a response when SCM has finished finalizing even if the DNs have not. For now we can use the existing finalization complete status in the response, but in follow up changes we will change how the status API works to indicate each component's finalization status individually.

I have created HDDS-14669 to make this change in a followup, as I think changing this will impact tests etc. So better to do it on its own change.

errose28 · 2026-02-19T00:27:54Z

...zone/integration-test/src/test/java/org/apache/hadoop/hdds/upgrade/TestHddsUpgradeUtils.java

      LOG.info("testPostUpgradeConditionsSCM: container state is {}",
          ciState.name());
-      assertTrue((ciState == HddsProtos.LifeCycleState.CLOSED) ||
+      // Containers can now be in any state since we no longer close pipelines


Can we remove this test on container states completely? The new upgrade flow should be independent of any container states. There's a few other places in this test where I think we can remove similar checks.

sodonnel · 2026-02-19T17:01:41Z

Thanks for working on this @sodonnel. I assume we are planning to remove the healthy readonly state and finalization checkpoints in follow up changes. Can you file those Jiras to clarify the scope of this PR?

I have filed the follow to continue this work:

HDDS-14669 Waiting in finalization should not block a server thread
HDDS-14670 SCM should not finalize unless it is out of
HDDS-14671 Remove healthy_readonly state from SCM
HDDS-14672 Remove finalization checkpoints

github-actions bot added the zdu Pull requests for Zero Downtime Upgrade (ZDU) https://issues.apache.org/jira/browse/HDDS-14496 label Feb 17, 2026

sodonnel force-pushed the close-pipelines branch from c8172b8 to fe8fff6 Compare February 17, 2026 17:10

yandrey321 reviewed Feb 17, 2026

View reviewed changes

S O'Donnell added 5 commits February 18, 2026 11:30

Initial changes

8ebeba5

Fix PMD

96e9822

DNs can finalize when there are open containers. SCM waits for DNs to…

f101361

… all finalize before finalization is complete

Remove debug loglines

8957d16

Remove untended whitespace changes

9f6bac6

sodonnel force-pushed the close-pipelines branch from 9acf575 to 9f6bac6 Compare February 18, 2026 12:02

sodonnel requested a review from errose28 February 18, 2026 19:21

errose28 reviewed Feb 19, 2026

View reviewed changes

S O'Donnell added 4 commits February 19, 2026 15:37

Remove needless override

8fa2cec

Remove container states from tests

75d3eac

Remove forceNodesToHealthyReadOnly as no longer used

429d7e1

Remove no longer used crossedCheckpoint from interface

767bff1

S O'Donnell added 3 commits February 19, 2026 19:08

Empty commit

2cb886f

fix findbugs

01201cf

Empty commit

9459a20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-14646. SCM should not close Ratis pipelines on Finalize#9779

HDDS-14646. SCM should not close Ratis pipelines on Finalize#9779
sodonnel wants to merge 12 commits intoapache:HDDS-14496-zdufrom
sodonnel:close-pipelines

sodonnel commented Feb 17, 2026

Uh oh!

yandrey321 Feb 17, 2026

Uh oh!

yandrey321 Feb 17, 2026

Uh oh!

sodonnel commented Feb 18, 2026 •

edited

Loading

Uh oh!

adoroszlai commented Feb 18, 2026

Uh oh!

errose28 left a comment

Uh oh!

errose28 Feb 18, 2026

Uh oh!

errose28 Feb 18, 2026

Uh oh!

errose28 Feb 18, 2026

Uh oh!

errose28 Feb 19, 2026

Uh oh!

sodonnel Feb 19, 2026

Uh oh!

errose28 Feb 19, 2026

Uh oh!

sodonnel commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments


		void onLeaderReady();

		static boolean shouldCreateNewPipelines(FinalizationCheckpoint checkpoint) {

Conversation

sodonnel commented Feb 17, 2026

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sodonnel commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adoroszlai commented Feb 18, 2026

Uh oh!

errose28 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sodonnel commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

sodonnel commented Feb 18, 2026 •

edited

Loading