Docs: add disaster recovery guides for partial and complete failure by moizpgedge · Pull Request #268 · pgEdge/control-plane

moizpgedge · 2026-02-17T15:46:35Z

Summary

Adds disaster recovery documentation: an index plus two guides—Partial Failure Recovery (Quorum Intact) and Complete Failure Recovery (No Quorum)—so operators can recover the Control Plane when hosts or quorum are lost.

Changes

docs/disaster-recovery/index.md — New index that routes to the right guide and warns about backing up the Control Plane data volume.
docs/disaster-recovery/partial-recovery.md — Guide for when etcd and Swarm quorum are intact but one or more hosts are lost: force-remove host via API, optional DB spec updates, Swarm cleanup, verification, then restore (Path A: host still accessible; Path B: provision new host), rejoin cluster, and restore database capacity.
docs/disaster-recovery/full-recovery.md — Guide for when etcd and/or Swarm quorum is lost: Phase 1 (1A total quorum loss, 1B majority quorum loss, 1C etcd + Swarm loss) with snapshot restore, Swarm re-init, registry rebuild, and ForceNewCluster where needed; Phase 2 remove dead hosts and clean DBs; Phases 3–5 rejoin/provision hosts, join cluster, restore DB capacity, and final verification. Includes prerequisites, variable setup, and common issues/troubleshooting.

Testing

Documentation only; no code or test changes.
Manual review: follow steps in a test environment (e.g. Lima fixture) for both partial and full-recovery flows if desired.

Checklist

Tests added or updated (unit and/or e2e, as needed)
Documentation updated (if needed)
Issue is linked (branch name or URL in PR description)
Changelog entry added for user-facing behavior changes (changes/unreleased/Added-disaster-recovery-docs.yaml)
Breaking changes (if any) are clearly called out in the PR description

Notes for Reviewers

Branch: docs/PLAT-313-314-etcd-swarm-disaster-recovery — PLAT-313 and PLAT-314
Partial guide assumes API is up; full guide assumes API is down until Phase 1 is done. Cross-links between them (e.g. partial-recovery → full-recovery for quorum loss) are already in place.
Full-recovery Phase 1C (etcd + Swarm quorum loss) is the most involved: Swarm re-init, dead node removal, registry recreate, image build/push, and PGEDGE_ETCD_SERVER__FORCE_NEW_CLUSTER=true; worth a careful read.

ChangeLog

Entry added in changes/unreleased/Added-disaster-recovery-docs.yaml for the disaster recovery documentation.

coderabbitai · 2026-02-17T15:46:52Z

📝 Walkthrough

Walkthrough

Adds comprehensive disaster recovery documentation for the Control Plane: an index plus two guides (partial recovery with quorum intact and complete recovery with no quorum). Documents phased workflows, procedures for etcd and Docker Swarm recovery, prerequisites, verification, and troubleshooting. No code or public API changes.

Changes

Cohort / File(s)	Summary
Disaster Recovery Guides `docs/disaster-recovery/index.md`, `docs/disaster-recovery/partial-recovery.md`, `docs/disaster-recovery/full-recovery.md`	Introduces three new docs: an index listing recovery scenarios; a partial-recovery guide for quorum-intact host loss with phase-based procedures, CLI/API examples, verification, and troubleshooting; and a full-recovery guide for no-quorum scenarios covering etcd snapshot restores, Swarm reinitialization, multi-branch phase workflows, and detailed commands.
Changelog Entry `changes/unreleased/Added-disaster-recovery-docs.yaml`	Adds a release-note YAML recording the addition of the disaster recovery documentation with timestamp and description.

Poem

🐇 I nudge the burrow, maps unfurled,
Phases five to mend the world,
Etcd snaps and Swarm reborn,
Hosts return at break of morn,
Hop — follow steps, the cluster's healed.

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Docs: add disaster recovery guides for partial and complete failure' clearly summarizes the main change—addition of comprehensive disaster recovery documentation.
Description check	✅ Passed	The PR description is comprehensive and follows the template structure with all key sections: Summary, Changes, Testing, Checklist, and Notes for Reviewers. All critical items are addressed.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch docs/PLAT-313-314-etcd-swarm-disaster-recovery

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (4)

docs/disaster-recovery/full-recovery.md (2)
288-295: Add a language specifier to the fenced code block.

Same as in partial-recovery.md — the docker node ls example output block should use ```text or ```console.
Proposed fix
-```
+```text
 ID                            HOSTNAME      STATUS    AVAILABILITY   MANAGER STATUS
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/disaster-recovery/full-recovery.md` around lines 288 - 295, Update the
fenced code block showing the docker node ls example output to include a
language specifier (e.g., use ```text or ```console instead of plain ```) so
syntax-aware renderers treat it as plain text; locate the fenced block
containing the sample output for "docker node ls" (the block starting with the
ID/HOSTNAME/STATUS header) and replace the opening fence with ```text (or
```console) accordingly.
204-210: Phase 1B backup step only preserves etcd directory, unlike Phase 1A.

Phase 1A Step 1A.3 backs up etcd data, certificates, and generated.config.json. Phase 1B Step 1B.1 only backs up the etcd directory. Since the surviving host in 1B still has valid certificates and config, this is likely intentional — but a brief note explaining the difference would help operators avoid second-guessing during a high-stress recovery.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/disaster-recovery/full-recovery.md` around lines 204 - 210, Phase 1B
Step 1B.1 only backs up the etcd directory whereas Phase 1A Step 1A.3 also backs
up certificates and generated.config.json; add a brief explanatory note under
"Step 1B.1: Backup Existing etcd Data" clarifying that in Phase 1B the surviving
host is expected to already have valid certificates and configuration so only
the etcd data is preserved, and reference Step 1A.3 for the full backup behavior
to avoid operator confusion.
docs/disaster-recovery/partial-recovery.md (1)
134-140: Add a language specifier to the fenced code block.

The example docker node ls output block on Line 135 lacks a language identifier. Use ```text or ```console for plain output blocks to satisfy the MD040 lint rule and improve rendering consistency.
Proposed fix
-```
+```text
 ID                            HOSTNAME      STATUS    AVAILABILITY   MANAGER STATUS
 4aoqjp3q8jcny4kec5nadcn6x *   lima-host-1   Ready     Active         Leader
 959g9937i62judknmr40kcw9r     lima-host-2   Ready     Active         Reachable
 l0l51d890edg3f0ccd0xppw06     lima-host-3   Down      Active         Unreachable
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>
Verify each finding against the current code and only fix it if needed.

In @docs/disaster-recovery/partial-recovery.md around lines 134 - 140, The
fenced code block that shows the docker node ls output (the block starting with
"ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS" in the partial-recovery
example) is missing a language specifier and triggers MD040; update that block
by adding a plain-output language tag such as text or console immediately
after the opening backticks so the snippet becomes a labelled code fence (e.g.,
change totext) to satisfy linting and rendering.
</details>

</blockquote></details>
<details>
<summary>mkdocs.yml (1)</summary><blockquote>

`67-69`: **Consider adding `disaster-recovery/index.md` as the section index page.**

Other nav sections (e.g., "Prerequisites", "Installing Control Plane") include their `index.md` as the first entry, making the section header clickable. Currently, "Disaster Recovery" is just a label and `index.md` isn't reachable from the nav, even though it exists and contains useful context (the backup warning).

<details>
<summary>Proposed fix</summary>

```diff
   - Disaster Recovery:
+      - disaster-recovery/index.md
       - Partial Failure Recovery (Quorum Intact): disaster-recovery/partial-recovery.md
       - Complete Failure Recovery (No Quorum): disaster-recovery/full-recovery.md
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@mkdocs.yml` around lines 67 - 69, The "Disaster Recovery" nav entry should
include its section index so the header becomes clickable; update the mkdocs.yml
nav under the "Disaster Recovery" section to insert "disaster-recovery/index.md"
as the first entry before "disaster-recovery/partial-recovery.md" and
"disaster-recovery/full-recovery.md" (so the section reads index.md,
partial-recovery.md, full-recovery.md).
```

</details>

</blockquote></details>

</blockquote></details>

<details>
<summary>🤖 Prompt for all review comments with AI agents</summary>
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @docs/disaster-recovery/full-recovery.md:

Around line 385-397: The docs show a port mismatch: the registry is published
on port 5000 but the control plane service sets
PGEDGE_DOCKER_SWARM__IMAGE_REPOSITORY_HOST to 127.0.0.1:5001/pgedge; update the
environment variable in the control-plane service creation
(PGEDGE_DOCKER_SWARM__IMAGE_REPOSITORY_HOST in the
control-plane_${RECOVERY_HOST_ID} service) to 127.0.0.1:5000/pgedge to match the
registry, or explicitly document why 5001 is used (e.g., mirror/proxy) if that
is intentional.

In @docs/disaster-recovery/partial-recovery.md:

Around line 76-82: The doc currently shows monitoring the host-removal task
via /v1/hosts/<LOST_HOST_ID>/tasks/<TASK_ID>, but since the host was
force-removed this endpoint may be unavailable; update the instructions to use a
cluster-level or recovery-host-scoped task endpoint (e.g., /v1/tasks/<TASK_ID>
or /v1/hosts/<RECOVERY_HOST_ID>/tasks/<TASK_ID>) or explicitly state that
/v1/hosts/<LOST_HOST_ID>/tasks/<TASK_ID> remains accessible after
deletion—replace the existing reference to
/v1/hosts/<LOST_HOST_ID>/tasks/<TASK_ID> accordingly and mention the alternative
endpoints.

Nitpick comments:
In @docs/disaster-recovery/full-recovery.md:

Around line 288-295: Update the fenced code block showing the docker node ls
example output to include a language specifier (e.g., use text or console
instead of plain ) so syntax-aware renderers treat it as plain text; locate the fenced block containing the sample output for "docker node ls" (the block starting with the ID/HOSTNAME/STATUS header) and replace the opening fence with text (or ```console) accordingly.

Around line 204-210: Phase 1B Step 1B.1 only backs up the etcd directory
whereas Phase 1A Step 1A.3 also backs up certificates and generated.config.json;
add a brief explanatory note under "Step 1B.1: Backup Existing etcd Data"
clarifying that in Phase 1B the surviving host is expected to already have valid
certificates and configuration so only the etcd data is preserved, and reference
Step 1A.3 for the full backup behavior to avoid operator confusion.

In @docs/disaster-recovery/partial-recovery.md:

Around line 134-140: The fenced code block that shows the docker node ls
output (the block starting with "ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS"
in the partial-recovery example) is missing a language specifier and triggers
MD040; update that block by adding a plain-output language tag such as text or console immediately after the opening backticks so the snippet becomes a
labelled code fence (e.g., change totext) to satisfy linting and
rendering.

In @mkdocs.yml:

Around line 67-69: The "Disaster Recovery" nav entry should include its
section index so the header becomes clickable; update the mkdocs.yml nav under
the "Disaster Recovery" section to insert "disaster-recovery/index.md" as the
first entry before "disaster-recovery/partial-recovery.md" and
"disaster-recovery/full-recovery.md" (so the section reads index.md,
partial-recovery.md, full-recovery.md).
</details>

docs/disaster-recovery/full-recovery.md

coderabbitai · 2026-02-17T15:50:41Z

docs/disaster-recovery/partial-recovery.md

⚠️ Potential issue | 🟡 Minor

Task monitoring endpoint references the removed host.

Line 78 monitors the host removal task via /v1/hosts/<LOST_HOST_ID>/tasks/<TASK_ID>, but the host was just force-removed in Step 1.1. This endpoint may not be available after deletion. Consider using a cluster-level or recovery-host-scoped task endpoint instead, or clarify the expected behavior (e.g., whether the task endpoint remains accessible for a removed host).

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/disaster-recovery/partial-recovery.md` around lines 76 - 82, The doc currently shows monitoring the host-removal task via /v1/hosts/<LOST_HOST_ID>/tasks/<TASK_ID>, but since the host was force-removed this endpoint may be unavailable; update the instructions to use a cluster-level or recovery-host-scoped task endpoint (e.g., /v1/tasks/<TASK_ID> or /v1/hosts/<RECOVERY_HOST_ID>/tasks/<TASK_ID>) or explicitly state that /v1/hosts/<LOST_HOST_ID>/tasks/<TASK_ID> remains accessible after deletion—replace the existing reference to /v1/hosts/<LOST_HOST_ID>/tasks/<TASK_ID> accordingly and mention the alternative endpoints.

jason-lynch

This is getting closer! There are still a lot of repeated steps, and a few places where we're doing the same things in different ways. I think we can do more to deduplicate and simplify these documents. Could this just be one document?:

Disaster recovery
- Restoring Docker Swarm
  - Reinitializing the Swarm
  - Joining hosts to the new Swarm
  - Removing old Swarm nodes
- Restoring the Control Plane
  - Reinitializing the Control Plane Etcd cluster
  - Updating databases to remove old hosts
  - Removing old hosts
  - Re-adding hosts
  - Updating databases to re-add hosts

jason-lynch · 2026-02-17T16:57:20Z

mkdocs.yml

Could you please remove this for now? I think @mmols had suggested that we go through QA and documentation review before we publish this to the site.

jason-lynch · 2026-02-17T17:01:03Z

docs/disaster-recovery/full-recovery.md

"Registry recreation" and "image rebuild" are exclusive to our development environment, so we should not document them in our user-facing documentation.

jason-lynch · 2026-02-17T17:08:11Z

docs/disaster-recovery/full-recovery.md

What are the conditions where it becomes necessary to do this?

jason-lynch · 2026-02-17T17:20:21Z

docs/disaster-recovery/full-recovery.md

We can't write specific instructions for this step because it's very dependent on the user's environment. You can simulate it locally using cp or Lima's snapshot tool, but in a real environment it will look quite different. Here are some examples of different procedures that users might follow:

https://docs.aws.amazon.com/prescriptive-guidance/latest/backup-recovery/restore.html

https://techdocs.broadcom.com/us/en/vmware-cis/vsphere/container-storage-plugin/3-0/getting-started-with-vmware-vsphere-container-storage-plug-in-3-0/using-vsphere-container-storage-plug-in/volume-snapshot-and-restore.html

https://learn.microsoft.com/en-us/azure/backup/backup-azure-arm-restore-vms

https://docs.cloud.google.com/compute/docs/disks/restore-snapshot

We can't document every possible way of restoring a volume or an instance from a backup, so I wouldn't get more specific than this first sentence.

jason-lynch · 2026-02-17T17:22:51Z

docs/disaster-recovery/full-recovery.md

This will look better split onto two lines:

Suggested change

curl -L https://github.com/etcd-io/etcd/releases/download/v3.6.5/etcd-v3.6.5-linux-${ARCH}.tar.gz | tar --strip-components 1 -xz -C /tmp etcd-v3.6.5-linux-${ARCH}/etcdutl

curl -L https://github.com/etcd-io/etcd/releases/download/v3.6.5/etcd-v3.6.5-linux-${ARCH}.tar.gz \

| tar --strip-components 1 -xz -C /tmp etcd-v3.6.5-linux-${ARCH}/etcdutl

jason-lynch · 2026-02-17T17:59:51Z

docs/disaster-recovery/full-recovery.md

This is not important. The order should not matter.

jason-lynch · 2026-02-17T18:00:54Z

docs/disaster-recovery/full-recovery.md

Healthy client-mode hosts do not need to be removed at all, so users should never be in this position.

jason-lynch · 2026-02-17T18:05:53Z

docs/disaster-recovery/full-recovery.md

Please remove these Lima-specific instructions. This is something that's specific to our development environment, so these instructions are not meaningful to users,

jason-lynch · 2026-02-17T18:09:09Z

docs/disaster-recovery/full-recovery.md

+If Swarm no longer has the service definition:
+
+```bash
+docker stack deploy -c <path-to-stack-yaml> control-plane


This is the first time we mention the stack definition in this document. The stack definition should be listed as a pre-requisite so that users know to prepare it ahead of time.

Also - keep in mind that doing this deploy will recreate and scale up all services in the stack. I would just do this one time after I've gotten all of my hosts ready to deploy again.

jason-lynch · 2026-02-17T18:12:14Z

docs/disaster-recovery/partial-recovery.md

+
+---
+
+## Phase 1: Remove the Failed Host


Let's be consistent with the full recovery document and say that users should update the database to remove the lost host first, and then remove the host without the force parameter. As I mentioned in a comment on the other doc, I'm leaning towards removing the force parameter altogether from that remove host endpoint because it can lead to problems.

coderabbitai

Actionable comments posted: 6

🧹 Nitpick comments (2)

docs/disaster-recovery/full-recovery.md (2)

272-277: Fenced code block at line 272 is missing a language specifier (MD040).

♻️ Proposed fix

-```
+```text
 ID                            HOSTNAME      STATUS    AVAILABILITY   MANAGER STATUS
 ...

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@docs/disaster-recovery/full-recovery.md` around lines 272 - 277, The fenced
code block starting with triple-backticks that contains the table header "ID    
HOSTNAME      STATUS    AVAILABILITY   MANAGER STATUS" is missing a language
specifier; update the opening fence to include a language (e.g., change ``` to
```text or ```console) so the block complies with MD040 and renders/linters
correctly, leaving the block contents unchanged.

77-80: Indented code blocks inside admonitions should be fenced (MD046).

markdownlint flags five locations (lines 79, 151, 184, 254, 551) where 4-space indented blocks are used inside !!! admonitions. These may not render as code blocks in all Markdown renderers; fenced blocks are the reliable and consistent approach.

Example fix for the pattern (apply similarly at all five locations):

♻️ Proposed fix pattern

 !!! important "Reset Cluster Membership for Multi-Node Clusters"
 
-    If your cluster previously had more than one node, you **must** use `etcdutl snapshot restore` to reset cluster membership. Simply copying the etcd directory will not work.
+    If your cluster previously had more than one node, you **must** use `etcdutl snapshot restore`
+    to reset cluster membership. Simply copying the etcd directory will not work.

For any nested code snippets inside the admonition, use a fenced block indented with 4 spaces:

 !!! note
 
-    If Docker Swarm is still functional (only etcd lost quorum), use [Phase 1B](`#phase-1b-majority-quorum-loss`) instead.
+    If Docker Swarm is still functional (only etcd lost quorum), use [Phase 1B](`#phase-1b-majority-quorum-loss`) instead.

Also applies to: 149-152, 182-185, 252-255, 549-552

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@docs/disaster-recovery/full-recovery.md` around lines 77 - 80, Replace the
4-space indented code blocks inside the "!!! important" admonitions with fenced
code blocks so they render consistently; locate the admonition starting with
'!!! important "Reset Cluster Membership for Multi-Node Clusters"' and the other
admonitions at the flagged locations (around lines referenced in the review: 79,
151, 184, 254, 551) and change any nested indented blocks to triple-backtick
fenced blocks (```), preserving language hints if present and maintaining the
same indentation level for the admonition content.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/disaster-recovery/full-recovery.md`:
- Around line 200-213: Phase 1B currently contradicts itself: "Option A"
instructs running etcdctl snapshot save against a live endpoint (the etcdctl
snapshot save command) even though Phase 1B assumes majority quorum loss and
etcd is inaccessible; either delete the entire "Option A: Create Snapshot from
Existing Data" subsection or rewrite it to clearly define a distinct
sub-scenario (e.g., "Partial Quorum: single surviving node with etcd running")
that is explicitly separate from the majority-quorum-lost path, and in that
rewrite explain preconditions, when to run the etcdctl snapshot save command,
and add a cross-reference to "Option B" (pre-existing snapshot) for the true
quorum-loss recovery path so readers are not misled.
- Line 51: The document contains literal draft comments using C-style "// To Do"
text (e.g. "// TO DO link the installation.md file section here", "// To Do A
snapshot file is optional...", "// To Do The title of this section, "Majority
Quorum Loss"...", "// To Do a snapshot is optional...", "// To Do use etcdutl
snapshot restore", "// To Do we will keep one section in final doc") which
render verbatim and trigger MD046; remove or resolve each by either deleting
them, converting them to proper HTML comments (<!-- TODO: ... -->) or replacing
them with real content or a Markdown TODO/task list or an actual link (e.g.
install section link) so the draft notes no longer appear in the published
output and the lint warning is eliminated.
- Around line 315-346: Phase 1C is incomplete: add concrete Step 1C.4 and 1C.5
and replace the Step 1C.6 TODO with the etcd ForceNewCluster restore steps, then
update the dangling cross-reference and summary table. Specifically, create Step
1C.4 "Recreate container registry on surviving host" that documents rebuilding
or restoring the registry (recreate registry container, load required
images/tags, push images back into registry) and reference the surviving
host/node names (e.g., host-2/n2); add Step 1C.5 "Prepare etcd snapshot and
prerequisites" describing how to fetch/transfer the snapshot, validate it, stop
etcd on recovery nodes and prepare data dirs; replace Step 1C.6 with a full
"ForceNewCluster etcdutl snapshot restore" sequence (use etcdutl snapshot
restore with --force-new-cluster and correct --name/--initial-cluster settings,
restore to expected data-dir and systemd restart steps) and ensure Step 1C.7
remains the verification curl. Finally, update the Common Issues cross-reference
that mentions "Phase 1C, Step 1C.4" and the Summary table range/actions to
include the rebuilt registry and ForceNewCluster restore so documentation is
consistent.
- Around line 367-369: Update the Step 2.1 curl example so the URL uses the same
port variable as the rest of the document: replace the hardcoded ":3000" in the
curl command URL (the line starting with curl -X POST
"http://${RECOVERY_HOST_IP}:3000/...) with the template variable ":${API_PORT}"
to match other API calls and maintain consistency.
- Around line 37-42: The Set Variables block is missing RECOVERY_HOST_ID and
API_PORT which are referenced throughout the guide; update the initial variables
list to include RECOVERY_HOST_ID (the identifier/name of the recovery host used
in commands and SSH/etcd membership steps) and API_PORT (the control-plane API
port used by the recovery host), and ensure the examples show sensible
placeholders like RECOVERY_HOST_ID="<recovery-host-id>" and API_PORT=<api-port>
so all subsequent snippets (references to RECOVERY_HOST_ID and API_PORT) work
without reader guesswork; verify mentions in functions/sections that reference
RECOVERY_HOST_ID and API_PORT still match the variable names exactly.
- Around line 130-133: Remove the trailing space after the backslash in the curl
line so the shell line continuation works (ensure the backslash is the last
character before the newline in the "curl -L
https://github.com/etcd-io/etcd/releases/download/v3.6.5/etcd-v3.6.5-linux-${ARCH}.tar.gz
\ | tar --strip-components 1 -xz -C /tmp etcd-v3.6.5-linux-${ARCH}/etcdutl"
snippet) or collapse into a single line; also update the hardcoded release tag
from v3.6.5 to v3.6.8 or replace it with a VERSION variable so the etcd download
URL and the etcd-v... names stay consistent and current, and verify the
subsequent sudo mv/ chmod lines still reference the same etcdutl filename.

---

Nitpick comments:
In `@docs/disaster-recovery/full-recovery.md`:
- Around line 272-277: The fenced code block starting with triple-backticks that
contains the table header "ID                            HOSTNAME      STATUS   
AVAILABILITY   MANAGER STATUS" is missing a language specifier; update the
opening fence to include a language (e.g., change ``` to ```text or ```console)
so the block complies with MD040 and renders/linters correctly, leaving the
block contents unchanged.
- Around line 77-80: Replace the 4-space indented code blocks inside the "!!!
important" admonitions with fenced code blocks so they render consistently;
locate the admonition starting with '!!! important "Reset Cluster Membership for
Multi-Node Clusters"' and the other admonitions at the flagged locations (around
lines referenced in the review: 79, 151, 184, 254, 551) and change any nested
indented blocks to triple-backtick fenced blocks (```), preserving language
hints if present and maintaining the same indentation level for the admonition
content.

docs/disaster-recovery/full-recovery.md

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (2)

docs/disaster-recovery/full-recovery.md (2)
330-332: Step 1C.7 lacks concrete commands for registry recreation — the hardest step in Phase 1C.

This is the only recovery path for operators who have lost both etcd and Swarm quorum. The step instructs them to "recreate the registry and build/push the Control Plane image" but provides no commands, and the ARCHIVE_VERSION variable defined in Set Variables (line 50) is clearly intended for exactly these commands. An operator in a production incident cannot act on prose alone.

Consider adding a code block with the docker registry/image commands (e.g., docker service create for the registry, docker pull/docker tag/docker push using ${ARCHIVE_VERSION}), or at minimum link to a dedicated registry setup reference.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/disaster-recovery/full-recovery.md` around lines 330 - 332, Step 1C.7 is
missing the concrete commands to recreate a container registry and push the
Control Plane image using the ARCHIVE_VERSION variable; update the prose to
include an explicit sequence showing how to create the registry service and then
pull/tag/push the Control Plane image using ${ARCHIVE_VERSION} (or, if you
prefer not to inline commands, add a clear link to a registry setup reference),
and ensure you reference the ARCHIVE_VERSION variable defined earlier so
operators can copy-paste the exact commands for "recreate registry" and
"build/push Control Plane image" before starting the Control Plane service in
Step 1C.7.
461-469: Step 3A.2 doesn't distinguish server-mode from client-mode hosts, unlike the partial-recovery guide.

partial-recovery.md Step 3A.2 has separate blocks for server-mode (clear etcd + certificates + generated.config.json) and client-mode (clear only generated.config.json). This guide applies the server-mode commands to all hosts unconditionally. While rm -rf on absent directories is harmless, aligning with the partial-recovery guide's pattern avoids operator confusion when working across both docs.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/disaster-recovery/full-recovery.md` around lines 461 - 469, Step 3A.2
currently applies the server-mode cleanup (removing ${PGEDGE_DATA_DIR}/etcd,
/certificates, and generated.config.json) to all hosts; update this section to
mirror partial-recovery.md by distinguishing server-mode vs client-mode hosts:
for server-mode instruct removal of etcd, certificates, and
generated.config.json, and for client-mode instruct removal of only
generated.config.json (use the same labels "server-mode" and "client-mode" and
reference the PGEDGE_DATA_DIR variable and filenames to make the distinction
clear).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/disaster-recovery/full-recovery.md`:
- Around line 48-51: The variable ARCHIVE_VERSION is declared but unused; either
remove the ARCHIVE_VERSION="<control-plane-version>" line from the env block or
add the registry-rebuild commands that consume it (e.g., the
registry-rebuild/registry restore commands referenced in Step 1C.7) so the guide
actually uses ${ARCHIVE_VERSION}; update the examples and any subsequent
commands to reference ARCHIVE_VERSION in place of a hard-coded version to make
the variable meaningful.
- Around line 256-261: The fenced code block showing the `docker node ls` output
is missing a language specifier which triggers MD040; update the triple-backtick
fence for that block in full-recovery.md to include a language tag such as text
(e.g., change ``` to ```text) so the block is explicitly marked and linting
passes; locate the block containing the `ID HOSTNAME STATUS AVAILABILITY MANAGER
STATUS` table and add the language specifier to the opening fence.

In `@docs/disaster-recovery/partial-recovery.md`:
- Around line 105-110: Add a language specifier to the fenced code block
containing the `docker node ls` example (the block showing ID, HOSTNAME, STATUS,
AVAILABILITY, MANAGER STATUS) to silence MD040 and improve rendering; update the
opening fence from ``` to ```text (or ```console) so the block becomes a labeled
code block.
- Line 403: The step label is incorrect: remove the "(force remove)" note from
Step 1.1 and add it to the Step 1.2 text so the host deletion action is
documented under Step 1.2 (leave Step 1.1 describing only database updates);
update the sentence for Step 1.2 to read something like "Step 1.2 — force remove
(host record deletion) and wait for the task to finish before rejoining" so the
force-remove action is clearly associated with Step 1.2.

---

Duplicate comments:
In `@docs/disaster-recovery/partial-recovery.md`:
- Around line 88-92: The doc uses the per-host task endpoint
/v1/hosts/<LOST_HOST_ID>/tasks/<TASK_ID> to poll a removal task on a host that
was just deleted; since that host may become unavailable, update the example to
query a stable API host instead—either call the cluster-level endpoint
/v1/tasks/<TASK_ID> (preferred) or replace <LOST_HOST_ID> with <HEALTHY_HOST> so
the curl targets a healthy node (/v1/hosts/<HEALTHY_HOST>/tasks/<TASK_ID>), and
update the curl example accordingly.

---

Nitpick comments:
In `@docs/disaster-recovery/full-recovery.md`:
- Around line 330-332: Step 1C.7 is missing the concrete commands to recreate a
container registry and push the Control Plane image using the ARCHIVE_VERSION
variable; update the prose to include an explicit sequence showing how to create
the registry service and then pull/tag/push the Control Plane image using
${ARCHIVE_VERSION} (or, if you prefer not to inline commands, add a clear link
to a registry setup reference), and ensure you reference the ARCHIVE_VERSION
variable defined earlier so operators can copy-paste the exact commands for
"recreate registry" and "build/push Control Plane image" before starting the
Control Plane service in Step 1C.7.
- Around line 461-469: Step 3A.2 currently applies the server-mode cleanup
(removing ${PGEDGE_DATA_DIR}/etcd, /certificates, and generated.config.json) to
all hosts; update this section to mirror partial-recovery.md by distinguishing
server-mode vs client-mode hosts: for server-mode instruct removal of etcd,
certificates, and generated.config.json, and for client-mode instruct removal of
only generated.config.json (use the same labels "server-mode" and "client-mode"
and reference the PGEDGE_DATA_DIR variable and filenames to make the distinction
clear).

coderabbitai · 2026-02-18T12:59:52Z

docs/disaster-recovery/full-recovery.md

+```bash
+RECOVERY_HOST_EXTERNAL_IP="<recovery-host-external-ip>"  # e.g., 192.168.105.4
+ARCHIVE_VERSION="<control-plane-version>"                  # e.g., 0.6.2
+```


⚠️ Potential issue | 🟡 Minor

ARCHIVE_VERSION is defined but never referenced in the document.

No code block anywhere in the guide uses ${ARCHIVE_VERSION}. Either add the concrete registry-rebuild commands that consume it (see also Step 1C.7 below), or remove it from the variable block to avoid confusing operators.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/disaster-recovery/full-recovery.md` around lines 48 - 51, The variable ARCHIVE_VERSION is declared but unused; either remove the ARCHIVE_VERSION="<control-plane-version>" line from the env block or add the registry-rebuild commands that consume it (e.g., the registry-rebuild/registry restore commands referenced in Step 1C.7) so the guide actually uses ${ARCHIVE_VERSION}; update the examples and any subsequent commands to reference ARCHIVE_VERSION in place of a hard-coded version to make the variable meaningful.

coderabbitai · 2026-02-18T12:59:52Z

docs/disaster-recovery/full-recovery.md

+```
+ID                            HOSTNAME      STATUS    AVAILABILITY   MANAGER STATUS
+4aoqjp3q8jcny4kec5nadcn6x     lima-host-1   Down      Active         Unreachable
+959g9937i62judknmr40kcw9r *   lima-host-2   Ready     Active         Leader
+l0l51d890edg3f0ccd0xppw06     lima-host-3   Down      Active         Unreachable
+```


⚠️ Potential issue | 🟡 Minor

Add a language specifier to the fenced code block.

Same MD040 issue as in partial-recovery.md — the docker node ls output block lacks a language tag.

✏️ Proposed fix

-``` +```text ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS 4aoqjp3q8jcny4kec5nadcn6x lima-host-1 Down Active Unreachable 959g9937i62judknmr40kcw9r * lima-host-2 Ready Active Leader l0l51d890edg3f0ccd0xppw06 lima-host-3 Down Active Unreachable

🧰 Tools

🪛 markdownlint-cli2 (0.21.0)

[warning] 256-256: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/disaster-recovery/full-recovery.md` around lines 256 - 261, The fenced code block showing the `docker node ls` output is missing a language specifier which triggers MD040; update the triple-backtick fence for that block in full-recovery.md to include a language tag such as text (e.g., change ``` to ```text) so the block is explicitly marked and linting passes; locate the block containing the `ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS` table and add the language specifier to the opening fence.

coderabbitai · 2026-02-18T12:59:52Z

docs/disaster-recovery/partial-recovery.md

+```
+ID                            HOSTNAME      STATUS    AVAILABILITY   MANAGER STATUS
+4aoqjp3q8jcny4kec5nadcn6x *   lima-host-1   Ready     Active         Leader
+959g9937i62judknmr40kcw9r     lima-host-2   Ready     Active         Reachable
+l0l51d890edg3f0ccd0xppw06     lima-host-3   Down      Active         Unreachable
+```


⚠️ Potential issue | 🟡 Minor

Add a language specifier to the fenced code block.

The docker node ls example output uses a plain fence, which triggers MD040. Use text or console to silence the warning and improve rendering.

✏️ Proposed fix

-``` +```text ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS 4aoqjp3q8jcny4kec5nadcn6x * lima-host-1 Ready Active Leader 959g9937i62judknmr40kcw9r lima-host-2 Ready Active Reachable l0l51d890edg3f0ccd0xppw06 lima-host-3 Down Active Unreachable

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

```

ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS

4aoqjp3q8jcny4kec5nadcn6x * lima-host-1 Ready Active Leader

959g9937i62judknmr40kcw9r lima-host-2 Ready Active Reachable

l0l51d890edg3f0ccd0xppw06 lima-host-3 Down Active Unreachable

```

🧰 Tools

🪛 markdownlint-cli2 (0.21.0)

[warning] 105-105: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/disaster-recovery/partial-recovery.md` around lines 105 - 110, Add a language specifier to the fenced code block containing the `docker node ls` example (the block showing ID, HOSTNAME, STATUS, AVAILABILITY, MANAGER STATUS) to silence MD040 and improve rendering; update the opening fence from ``` to ```text (or ```console) so the block becomes a labeled code block.

coderabbitai · 2026-02-18T12:59:52Z

docs/disaster-recovery/partial-recovery.md

⚠️ Potential issue | 🟡 Minor

Wrong step reference — "(force remove)" belongs to Step 1.2, not Step 1.1.

Step 1.1 updates databases; the host record deletion (force remove) happens in Step 1.2.

✏️ Proposed fix

-**Solution:** Complete Step 1.1 (force remove) and wait for the task to finish before rejoining. +**Solution:** Complete Steps 1.1 and 1.2 (remove databases and force-remove the host) and wait for the task to finish before rejoining.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

**Solution:** Complete Step 1.1 (force remove) and wait for the task to finish before rejoining.

**Solution:** Complete Steps 1.1 and 1.2 (remove databases and force-remove the host) and wait for the task to finish before rejoining.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@docs/disaster-recovery/partial-recovery.md` at line 403, The step label is incorrect: remove the "(force remove)" note from Step 1.1 and add it to the Step 1.2 text so the host deletion action is documented under Step 1.2 (leave Step 1.1 describing only database updates); update the sentence for Step 1.2 to read something like "Step 1.2 — force remove (host record deletion) and wait for the task to finish before rejoining" so the force-remove action is clearly associated with Step 1.2.

tsivaprasad and others added 7 commits February 13, 2026 19:16

docs: etcd and swarm failure scenarios and recovery procedures

1a01b65

Partial and full recovery etcd/swarm docs

7813e44

Partial and full recovery etcd/swarm docs

cdd02f3

full recovery etcd/swarm docs duplication removed

b5bb1ff

Add partial recovery documentation for etcd and swarm (quorum intact)

a77f23c

Add full recovery documentation for etcd and swarm (no quorum)

62ba9b7

Minor issue resolved

92c3c1a

moizpgedge requested a review from jason-lynch February 17, 2026 15:46

moizpgedge requested a review from tsivaprasad February 17, 2026 15:46

coderabbitai bot reviewed Feb 17, 2026

View reviewed changes

changelog updated

ec2be84

jason-lynch reviewed Feb 17, 2026

View reviewed changes

partially addressed review comments and there are To Do's to finish

b8fcb8a

coderabbitai bot reviewed Feb 18, 2026

View reviewed changes

Suggested comments addressed

3f6b4ca

coderabbitai bot reviewed Feb 18, 2026

View reviewed changes

Suggested comments addressed

45bc71a

	curl -L https://github.com/etcd-io/etcd/releases/download/v3.6.5/etcd-v3.6.5-linux-${ARCH}.tar.gz \| tar --strip-components 1 -xz -C /tmp etcd-v3.6.5-linux-${ARCH}/etcdutl
	curl -L https://github.com/etcd-io/etcd/releases/download/v3.6.5/etcd-v3.6.5-linux-${ARCH}.tar.gz \
	\| tar --strip-components 1 -xz -C /tmp etcd-v3.6.5-linux-${ARCH}/etcdutl

	Solution: Complete Step 1.1 (force remove) and wait for the task to finish before rejoining.
	Solution: Complete Steps 1.1 and 1.2 (remove databases and force-remove the host) and wait for the task to finish before rejoining.

Conversation

moizpgedge commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Testing

Checklist

Notes for Reviewers

ChangeLog

Uh oh!

coderabbitai bot commented Feb 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Feb 17, 2026

Choose a reason for hiding this comment

Uh oh!

jason-lynch left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

moizpgedge commented Feb 17, 2026 •

edited

Loading

coderabbitai bot commented Feb 17, 2026 •

edited

Loading