Skip to content

Docs: add disaster recovery guides for partial and complete failure#268

Open
moizpgedge wants to merge 11 commits intomainfrom
docs/PLAT-313-314-etcd-swarm-disaster-recovery
Open

Docs: add disaster recovery guides for partial and complete failure#268
moizpgedge wants to merge 11 commits intomainfrom
docs/PLAT-313-314-etcd-swarm-disaster-recovery

Conversation

@moizpgedge
Copy link
Contributor

@moizpgedge moizpgedge commented Feb 17, 2026

Summary

Adds disaster recovery documentation: an index plus two guides—Partial Failure Recovery (Quorum Intact) and Complete Failure Recovery (No Quorum)—so operators can recover the Control Plane when hosts or quorum are lost.

Changes

  • docs/disaster-recovery/index.md — New index that routes to the right guide and warns about backing up the Control Plane data volume.
  • docs/disaster-recovery/partial-recovery.md — Guide for when etcd and Swarm quorum are intact but one or more hosts are lost: force-remove host via API, optional DB spec updates, Swarm cleanup, verification, then restore (Path A: host still accessible; Path B: provision new host), rejoin cluster, and restore database capacity.
  • docs/disaster-recovery/full-recovery.md — Guide for when etcd and/or Swarm quorum is lost: Phase 1 (1A total quorum loss, 1B majority quorum loss, 1C etcd + Swarm loss) with snapshot restore, Swarm re-init, registry rebuild, and ForceNewCluster where needed; Phase 2 remove dead hosts and clean DBs; Phases 3–5 rejoin/provision hosts, join cluster, restore DB capacity, and final verification. Includes prerequisites, variable setup, and common issues/troubleshooting.

Testing

  • Documentation only; no code or test changes.
  • Manual review: follow steps in a test environment (e.g. Lima fixture) for both partial and full-recovery flows if desired.

Checklist

  • Tests added or updated (unit and/or e2e, as needed)
  • Documentation updated (if needed)
  • Issue is linked (branch name or URL in PR description)
  • Changelog entry added for user-facing behavior changes (changes/unreleased/Added-disaster-recovery-docs.yaml)
  • Breaking changes (if any) are clearly called out in the PR description

Notes for Reviewers

  • Branch: docs/PLAT-313-314-etcd-swarm-disaster-recoveryPLAT-313 and PLAT-314
  • Partial guide assumes API is up; full guide assumes API is down until Phase 1 is done. Cross-links between them (e.g. partial-recovery → full-recovery for quorum loss) are already in place.
  • Full-recovery Phase 1C (etcd + Swarm quorum loss) is the most involved: Swarm re-init, dead node removal, registry recreate, image build/push, and PGEDGE_ETCD_SERVER__FORCE_NEW_CLUSTER=true; worth a careful read.

ChangeLog

Entry added in changes/unreleased/Added-disaster-recovery-docs.yaml for the disaster recovery documentation.

@coderabbitai
Copy link

coderabbitai bot commented Feb 17, 2026

📝 Walkthrough

Walkthrough

Adds comprehensive disaster recovery documentation for the Control Plane: an index plus two guides (partial recovery with quorum intact and complete recovery with no quorum). Documents phased workflows, procedures for etcd and Docker Swarm recovery, prerequisites, verification, and troubleshooting. No code or public API changes.

Changes

Cohort / File(s) Summary
Disaster Recovery Guides
docs/disaster-recovery/index.md, docs/disaster-recovery/partial-recovery.md, docs/disaster-recovery/full-recovery.md
Introduces three new docs: an index listing recovery scenarios; a partial-recovery guide for quorum-intact host loss with phase-based procedures, CLI/API examples, verification, and troubleshooting; and a full-recovery guide for no-quorum scenarios covering etcd snapshot restores, Swarm reinitialization, multi-branch phase workflows, and detailed commands.
Changelog Entry
changes/unreleased/Added-disaster-recovery-docs.yaml
Adds a release-note YAML recording the addition of the disaster recovery documentation with timestamp and description.

Poem

🐇 I nudge the burrow, maps unfurled,
Phases five to mend the world,
Etcd snaps and Swarm reborn,
Hosts return at break of morn,
Hop — follow steps, the cluster's healed.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Docs: add disaster recovery guides for partial and complete failure' clearly summarizes the main change—addition of comprehensive disaster recovery documentation.
Description check ✅ Passed The PR description is comprehensive and follows the template structure with all key sections: Summary, Changes, Testing, Checklist, and Notes for Reviewers. All critical items are addressed.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch docs/PLAT-313-314-etcd-swarm-disaster-recovery

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (4)
docs/disaster-recovery/full-recovery.md (2)

288-295: Add a language specifier to the fenced code block.

Same as in partial-recovery.md — the docker node ls example output block should use ```text or ```console.

Proposed fix
-```
+```text
 ID                            HOSTNAME      STATUS    AVAILABILITY   MANAGER STATUS
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/disaster-recovery/full-recovery.md` around lines 288 - 295, Update the
fenced code block showing the docker node ls example output to include a
language specifier (e.g., use ```text or ```console instead of plain ```) so
syntax-aware renderers treat it as plain text; locate the fenced block
containing the sample output for "docker node ls" (the block starting with the
ID/HOSTNAME/STATUS header) and replace the opening fence with ```text (or
```console) accordingly.

204-210: Phase 1B backup step only preserves etcd directory, unlike Phase 1A.

Phase 1A Step 1A.3 backs up etcd data, certificates, and generated.config.json. Phase 1B Step 1B.1 only backs up the etcd directory. Since the surviving host in 1B still has valid certificates and config, this is likely intentional — but a brief note explaining the difference would help operators avoid second-guessing during a high-stress recovery.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/disaster-recovery/full-recovery.md` around lines 204 - 210, Phase 1B
Step 1B.1 only backs up the etcd directory whereas Phase 1A Step 1A.3 also backs
up certificates and generated.config.json; add a brief explanatory note under
"Step 1B.1: Backup Existing etcd Data" clarifying that in Phase 1B the surviving
host is expected to already have valid certificates and configuration so only
the etcd data is preserved, and reference Step 1A.3 for the full backup behavior
to avoid operator confusion.
docs/disaster-recovery/partial-recovery.md (1)

134-140: Add a language specifier to the fenced code block.

The example docker node ls output block on Line 135 lacks a language identifier. Use ```text or ```console for plain output blocks to satisfy the MD040 lint rule and improve rendering consistency.

Proposed fix
-```
+```text
 ID                            HOSTNAME      STATUS    AVAILABILITY   MANAGER STATUS
 4aoqjp3q8jcny4kec5nadcn6x *   lima-host-1   Ready     Active         Leader
 959g9937i62judknmr40kcw9r     lima-host-2   Ready     Active         Reachable
 l0l51d890edg3f0ccd0xppw06     lima-host-3   Down      Active         Unreachable
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @docs/disaster-recovery/partial-recovery.md around lines 134 - 140, The
fenced code block that shows the docker node ls output (the block starting with
"ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS" in the partial-recovery
example) is missing a language specifier and triggers MD040; update that block
by adding a plain-output language tag such as text or console immediately
after the opening backticks so the snippet becomes a labelled code fence (e.g.,
change totext) to satisfy linting and rendering.


</details>

</blockquote></details>
<details>
<summary>mkdocs.yml (1)</summary><blockquote>

`67-69`: **Consider adding `disaster-recovery/index.md` as the section index page.**

Other nav sections (e.g., "Prerequisites", "Installing Control Plane") include their `index.md` as the first entry, making the section header clickable. Currently, "Disaster Recovery" is just a label and `index.md` isn't reachable from the nav, even though it exists and contains useful context (the backup warning).

<details>
<summary>Proposed fix</summary>

```diff
   - Disaster Recovery:
+      - disaster-recovery/index.md
       - Partial Failure Recovery (Quorum Intact): disaster-recovery/partial-recovery.md
       - Complete Failure Recovery (No Quorum): disaster-recovery/full-recovery.md
```
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

```
Verify each finding against the current code and only fix it if needed.

In `@mkdocs.yml` around lines 67 - 69, The "Disaster Recovery" nav entry should
include its section index so the header becomes clickable; update the mkdocs.yml
nav under the "Disaster Recovery" section to insert "disaster-recovery/index.md"
as the first entry before "disaster-recovery/partial-recovery.md" and
"disaster-recovery/full-recovery.md" (so the section reads index.md,
partial-recovery.md, full-recovery.md).
```

</details>

</blockquote></details>

</blockquote></details>

<details>
<summary>🤖 Prompt for all review comments with AI agents</summary>

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @docs/disaster-recovery/full-recovery.md:

  • Around line 385-397: The docs show a port mismatch: the registry is published
    on port 5000 but the control plane service sets
    PGEDGE_DOCKER_SWARM__IMAGE_REPOSITORY_HOST to 127.0.0.1:5001/pgedge; update the
    environment variable in the control-plane service creation
    (PGEDGE_DOCKER_SWARM__IMAGE_REPOSITORY_HOST in the
    control-plane_${RECOVERY_HOST_ID} service) to 127.0.0.1:5000/pgedge to match the
    registry, or explicitly document why 5001 is used (e.g., mirror/proxy) if that
    is intentional.

In @docs/disaster-recovery/partial-recovery.md:

  • Around line 76-82: The doc currently shows monitoring the host-removal task
    via /v1/hosts/<LOST_HOST_ID>/tasks/<TASK_ID>, but since the host was
    force-removed this endpoint may be unavailable; update the instructions to use a
    cluster-level or recovery-host-scoped task endpoint (e.g., /v1/tasks/<TASK_ID>
    or /v1/hosts/<RECOVERY_HOST_ID>/tasks/<TASK_ID>) or explicitly state that
    /v1/hosts/<LOST_HOST_ID>/tasks/<TASK_ID> remains accessible after
    deletion—replace the existing reference to
    /v1/hosts/<LOST_HOST_ID>/tasks/<TASK_ID> accordingly and mention the alternative
    endpoints.

Nitpick comments:
In @docs/disaster-recovery/full-recovery.md:

  • Around line 288-295: Update the fenced code block showing the docker node ls
    example output to include a language specifier (e.g., use text or console
    instead of plain ) so syntax-aware renderers treat it as plain text; locate the fenced block containing the sample output for "docker node ls" (the block starting with the ID/HOSTNAME/STATUS header) and replace the opening fence with text (or ```console) accordingly.
  • Around line 204-210: Phase 1B Step 1B.1 only backs up the etcd directory
    whereas Phase 1A Step 1A.3 also backs up certificates and generated.config.json;
    add a brief explanatory note under "Step 1B.1: Backup Existing etcd Data"
    clarifying that in Phase 1B the surviving host is expected to already have valid
    certificates and configuration so only the etcd data is preserved, and reference
    Step 1A.3 for the full backup behavior to avoid operator confusion.

In @docs/disaster-recovery/partial-recovery.md:

  • Around line 134-140: The fenced code block that shows the docker node ls
    output (the block starting with "ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS"
    in the partial-recovery example) is missing a language specifier and triggers
    MD040; update that block by adding a plain-output language tag such as text or console immediately after the opening backticks so the snippet becomes a
    labelled code fence (e.g., change totext) to satisfy linting and
    rendering.

In @mkdocs.yml:

  • Around line 67-69: The "Disaster Recovery" nav entry should include its
    section index so the header becomes clickable; update the mkdocs.yml nav under
    the "Disaster Recovery" section to insert "disaster-recovery/index.md" as the
    first entry before "disaster-recovery/partial-recovery.md" and
    "disaster-recovery/full-recovery.md" (so the section reads index.md,
    partial-recovery.md, full-recovery.md).

</details>

<!-- This is an auto-generated comment by CodeRabbit for review status -->

Comment on lines 76 to 82
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Task monitoring endpoint references the removed host.

Line 78 monitors the host removal task via /v1/hosts/<LOST_HOST_ID>/tasks/<TASK_ID>, but the host was just force-removed in Step 1.1. This endpoint may not be available after deletion. Consider using a cluster-level or recovery-host-scoped task endpoint instead, or clarify the expected behavior (e.g., whether the task endpoint remains accessible for a removed host).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/disaster-recovery/partial-recovery.md` around lines 76 - 82, The doc
currently shows monitoring the host-removal task via
/v1/hosts/<LOST_HOST_ID>/tasks/<TASK_ID>, but since the host was force-removed
this endpoint may be unavailable; update the instructions to use a cluster-level
or recovery-host-scoped task endpoint (e.g., /v1/tasks/<TASK_ID> or
/v1/hosts/<RECOVERY_HOST_ID>/tasks/<TASK_ID>) or explicitly state that
/v1/hosts/<LOST_HOST_ID>/tasks/<TASK_ID> remains accessible after
deletion—replace the existing reference to
/v1/hosts/<LOST_HOST_ID>/tasks/<TASK_ID> accordingly and mention the alternative
endpoints.

Copy link
Member

@jason-lynch jason-lynch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is getting closer! There are still a lot of repeated steps, and a few places where we're doing the same things in different ways. I think we can do more to deduplicate and simplify these documents. Could this just be one document?:

  • Disaster recovery
    • Restoring Docker Swarm
      • Reinitializing the Swarm
      • Joining hosts to the new Swarm
      • Removing old Swarm nodes
    • Restoring the Control Plane
      • Reinitializing the Control Plane Etcd cluster
      • Updating databases to remove old hosts
      • Removing old hosts
      • Re-adding hosts
      • Updating databases to re-add hosts

mkdocs.yml Outdated
Comment on lines 67 to 69
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please remove this for now? I think @mmols had suggested that we go through QA and documentation review before we publish this to the site.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Registry recreation" and "image rebuild" are exclusive to our development environment, so we should not document them in our user-facing documentation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the conditions where it becomes necessary to do this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't write specific instructions for this step because it's very dependent on the user's environment. You can simulate it locally using cp or Lima's snapshot tool, but in a real environment it will look quite different. Here are some examples of different procedures that users might follow:

We can't document every possible way of restoring a volume or an instance from a backup, so I wouldn't get more specific than this first sentence.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will look better split onto two lines:

Suggested change
curl -L https://github.com/etcd-io/etcd/releases/download/v3.6.5/etcd-v3.6.5-linux-${ARCH}.tar.gz | tar --strip-components 1 -xz -C /tmp etcd-v3.6.5-linux-${ARCH}/etcdutl
curl -L https://github.com/etcd-io/etcd/releases/download/v3.6.5/etcd-v3.6.5-linux-${ARCH}.tar.gz \
| tar --strip-components 1 -xz -C /tmp etcd-v3.6.5-linux-${ARCH}/etcdutl

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not important. The order should not matter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Healthy client-mode hosts do not need to be removed at all, so users should never be in this position.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove these Lima-specific instructions. This is something that's specific to our development environment, so these instructions are not meaningful to users,

If Swarm no longer has the service definition:

```bash
docker stack deploy -c <path-to-stack-yaml> control-plane
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the first time we mention the stack definition in this document. The stack definition should be listed as a pre-requisite so that users know to prepare it ahead of time.

Also - keep in mind that doing this deploy will recreate and scale up all services in the stack. I would just do this one time after I've gotten all of my hosts ready to deploy again.


---

## Phase 1: Remove the Failed Host
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's be consistent with the full recovery document and say that users should update the database to remove the lost host first, and then remove the host without the force parameter. As I mentioned in a comment on the other doc, I'm leaning towards removing the force parameter altogether from that remove host endpoint because it can lead to problems.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (2)
docs/disaster-recovery/full-recovery.md (2)

272-277: Fenced code block at line 272 is missing a language specifier (MD040).

♻️ Proposed fix
-```
+```text
 ID                            HOSTNAME      STATUS    AVAILABILITY   MANAGER STATUS
 ...
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/disaster-recovery/full-recovery.md` around lines 272 - 277, The fenced
code block starting with triple-backticks that contains the table header "ID    
HOSTNAME      STATUS    AVAILABILITY   MANAGER STATUS" is missing a language
specifier; update the opening fence to include a language (e.g., change ``` to
```text or ```console) so the block complies with MD040 and renders/linters
correctly, leaving the block contents unchanged.

77-80: Indented code blocks inside admonitions should be fenced (MD046).

markdownlint flags five locations (lines 79, 151, 184, 254, 551) where 4-space indented blocks are used inside !!! admonitions. These may not render as code blocks in all Markdown renderers; fenced blocks are the reliable and consistent approach.

Example fix for the pattern (apply similarly at all five locations):

♻️ Proposed fix pattern
 !!! important "Reset Cluster Membership for Multi-Node Clusters"
 
-    If your cluster previously had more than one node, you **must** use `etcdutl snapshot restore` to reset cluster membership. Simply copying the etcd directory will not work.
+    If your cluster previously had more than one node, you **must** use `etcdutl snapshot restore`
+    to reset cluster membership. Simply copying the etcd directory will not work.

For any nested code snippets inside the admonition, use a fenced block indented with 4 spaces:

 !!! note
 
-    If Docker Swarm is still functional (only etcd lost quorum), use [Phase 1B](`#phase-1b-majority-quorum-loss`) instead.
+    If Docker Swarm is still functional (only etcd lost quorum), use [Phase 1B](`#phase-1b-majority-quorum-loss`) instead.

Also applies to: 149-152, 182-185, 252-255, 549-552

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/disaster-recovery/full-recovery.md` around lines 77 - 80, Replace the
4-space indented code blocks inside the "!!! important" admonitions with fenced
code blocks so they render consistently; locate the admonition starting with
'!!! important "Reset Cluster Membership for Multi-Node Clusters"' and the other
admonitions at the flagged locations (around lines referenced in the review: 79,
151, 184, 254, 551) and change any nested indented blocks to triple-backtick
fenced blocks (```), preserving language hints if present and maintaining the
same indentation level for the admonition content.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/disaster-recovery/full-recovery.md`:
- Around line 200-213: Phase 1B currently contradicts itself: "Option A"
instructs running etcdctl snapshot save against a live endpoint (the etcdctl
snapshot save command) even though Phase 1B assumes majority quorum loss and
etcd is inaccessible; either delete the entire "Option A: Create Snapshot from
Existing Data" subsection or rewrite it to clearly define a distinct
sub-scenario (e.g., "Partial Quorum: single surviving node with etcd running")
that is explicitly separate from the majority-quorum-lost path, and in that
rewrite explain preconditions, when to run the etcdctl snapshot save command,
and add a cross-reference to "Option B" (pre-existing snapshot) for the true
quorum-loss recovery path so readers are not misled.
- Line 51: The document contains literal draft comments using C-style "// To Do"
text (e.g. "// TO DO link the installation.md file section here", "// To Do A
snapshot file is optional...", "// To Do The title of this section, "Majority
Quorum Loss"...", "// To Do a snapshot is optional...", "// To Do use etcdutl
snapshot restore", "// To Do we will keep one section in final doc") which
render verbatim and trigger MD046; remove or resolve each by either deleting
them, converting them to proper HTML comments (<!-- TODO: ... -->) or replacing
them with real content or a Markdown TODO/task list or an actual link (e.g.
install section link) so the draft notes no longer appear in the published
output and the lint warning is eliminated.
- Around line 315-346: Phase 1C is incomplete: add concrete Step 1C.4 and 1C.5
and replace the Step 1C.6 TODO with the etcd ForceNewCluster restore steps, then
update the dangling cross-reference and summary table. Specifically, create Step
1C.4 "Recreate container registry on surviving host" that documents rebuilding
or restoring the registry (recreate registry container, load required
images/tags, push images back into registry) and reference the surviving
host/node names (e.g., host-2/n2); add Step 1C.5 "Prepare etcd snapshot and
prerequisites" describing how to fetch/transfer the snapshot, validate it, stop
etcd on recovery nodes and prepare data dirs; replace Step 1C.6 with a full
"ForceNewCluster etcdutl snapshot restore" sequence (use etcdutl snapshot
restore with --force-new-cluster and correct --name/--initial-cluster settings,
restore to expected data-dir and systemd restart steps) and ensure Step 1C.7
remains the verification curl. Finally, update the Common Issues cross-reference
that mentions "Phase 1C, Step 1C.4" and the Summary table range/actions to
include the rebuilt registry and ForceNewCluster restore so documentation is
consistent.
- Around line 367-369: Update the Step 2.1 curl example so the URL uses the same
port variable as the rest of the document: replace the hardcoded ":3000" in the
curl command URL (the line starting with curl -X POST
"http://${RECOVERY_HOST_IP}:3000/...) with the template variable ":${API_PORT}"
to match other API calls and maintain consistency.
- Around line 37-42: The Set Variables block is missing RECOVERY_HOST_ID and
API_PORT which are referenced throughout the guide; update the initial variables
list to include RECOVERY_HOST_ID (the identifier/name of the recovery host used
in commands and SSH/etcd membership steps) and API_PORT (the control-plane API
port used by the recovery host), and ensure the examples show sensible
placeholders like RECOVERY_HOST_ID="<recovery-host-id>" and API_PORT=<api-port>
so all subsequent snippets (references to RECOVERY_HOST_ID and API_PORT) work
without reader guesswork; verify mentions in functions/sections that reference
RECOVERY_HOST_ID and API_PORT still match the variable names exactly.
- Around line 130-133: Remove the trailing space after the backslash in the curl
line so the shell line continuation works (ensure the backslash is the last
character before the newline in the "curl -L
https://github.com/etcd-io/etcd/releases/download/v3.6.5/etcd-v3.6.5-linux-${ARCH}.tar.gz
\ | tar --strip-components 1 -xz -C /tmp etcd-v3.6.5-linux-${ARCH}/etcdutl"
snippet) or collapse into a single line; also update the hardcoded release tag
from v3.6.5 to v3.6.8 or replace it with a VERSION variable so the etcd download
URL and the etcd-v... names stay consistent and current, and verify the
subsequent sudo mv/ chmod lines still reference the same etcdutl filename.

---

Nitpick comments:
In `@docs/disaster-recovery/full-recovery.md`:
- Around line 272-277: The fenced code block starting with triple-backticks that
contains the table header "ID                            HOSTNAME      STATUS   
AVAILABILITY   MANAGER STATUS" is missing a language specifier; update the
opening fence to include a language (e.g., change ``` to ```text or ```console)
so the block complies with MD040 and renders/linters correctly, leaving the
block contents unchanged.
- Around line 77-80: Replace the 4-space indented code blocks inside the "!!!
important" admonitions with fenced code blocks so they render consistently;
locate the admonition starting with '!!! important "Reset Cluster Membership for
Multi-Node Clusters"' and the other admonitions at the flagged locations (around
lines referenced in the review: 79, 151, 184, 254, 551) and change any nested
indented blocks to triple-backtick fenced blocks (```), preserving language
hints if present and maintaining the same indentation level for the admonition
content.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (2)
docs/disaster-recovery/full-recovery.md (2)

330-332: Step 1C.7 lacks concrete commands for registry recreation — the hardest step in Phase 1C.

This is the only recovery path for operators who have lost both etcd and Swarm quorum. The step instructs them to "recreate the registry and build/push the Control Plane image" but provides no commands, and the ARCHIVE_VERSION variable defined in Set Variables (line 50) is clearly intended for exactly these commands. An operator in a production incident cannot act on prose alone.

Consider adding a code block with the docker registry/image commands (e.g., docker service create for the registry, docker pull/docker tag/docker push using ${ARCHIVE_VERSION}), or at minimum link to a dedicated registry setup reference.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/disaster-recovery/full-recovery.md` around lines 330 - 332, Step 1C.7 is
missing the concrete commands to recreate a container registry and push the
Control Plane image using the ARCHIVE_VERSION variable; update the prose to
include an explicit sequence showing how to create the registry service and then
pull/tag/push the Control Plane image using ${ARCHIVE_VERSION} (or, if you
prefer not to inline commands, add a clear link to a registry setup reference),
and ensure you reference the ARCHIVE_VERSION variable defined earlier so
operators can copy-paste the exact commands for "recreate registry" and
"build/push Control Plane image" before starting the Control Plane service in
Step 1C.7.

461-469: Step 3A.2 doesn't distinguish server-mode from client-mode hosts, unlike the partial-recovery guide.

partial-recovery.md Step 3A.2 has separate blocks for server-mode (clear etcd + certificates + generated.config.json) and client-mode (clear only generated.config.json). This guide applies the server-mode commands to all hosts unconditionally. While rm -rf on absent directories is harmless, aligning with the partial-recovery guide's pattern avoids operator confusion when working across both docs.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/disaster-recovery/full-recovery.md` around lines 461 - 469, Step 3A.2
currently applies the server-mode cleanup (removing ${PGEDGE_DATA_DIR}/etcd,
/certificates, and generated.config.json) to all hosts; update this section to
mirror partial-recovery.md by distinguishing server-mode vs client-mode hosts:
for server-mode instruct removal of etcd, certificates, and
generated.config.json, and for client-mode instruct removal of only
generated.config.json (use the same labels "server-mode" and "client-mode" and
reference the PGEDGE_DATA_DIR variable and filenames to make the distinction
clear).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/disaster-recovery/full-recovery.md`:
- Around line 48-51: The variable ARCHIVE_VERSION is declared but unused; either
remove the ARCHIVE_VERSION="<control-plane-version>" line from the env block or
add the registry-rebuild commands that consume it (e.g., the
registry-rebuild/registry restore commands referenced in Step 1C.7) so the guide
actually uses ${ARCHIVE_VERSION}; update the examples and any subsequent
commands to reference ARCHIVE_VERSION in place of a hard-coded version to make
the variable meaningful.
- Around line 256-261: The fenced code block showing the `docker node ls` output
is missing a language specifier which triggers MD040; update the triple-backtick
fence for that block in full-recovery.md to include a language tag such as text
(e.g., change ``` to ```text) so the block is explicitly marked and linting
passes; locate the block containing the `ID HOSTNAME STATUS AVAILABILITY MANAGER
STATUS` table and add the language specifier to the opening fence.

In `@docs/disaster-recovery/partial-recovery.md`:
- Around line 105-110: Add a language specifier to the fenced code block
containing the `docker node ls` example (the block showing ID, HOSTNAME, STATUS,
AVAILABILITY, MANAGER STATUS) to silence MD040 and improve rendering; update the
opening fence from ``` to ```text (or ```console) so the block becomes a labeled
code block.
- Line 403: The step label is incorrect: remove the "(force remove)" note from
Step 1.1 and add it to the Step 1.2 text so the host deletion action is
documented under Step 1.2 (leave Step 1.1 describing only database updates);
update the sentence for Step 1.2 to read something like "Step 1.2 — force remove
(host record deletion) and wait for the task to finish before rejoining" so the
force-remove action is clearly associated with Step 1.2.

---

Duplicate comments:
In `@docs/disaster-recovery/partial-recovery.md`:
- Around line 88-92: The doc uses the per-host task endpoint
/v1/hosts/<LOST_HOST_ID>/tasks/<TASK_ID> to poll a removal task on a host that
was just deleted; since that host may become unavailable, update the example to
query a stable API host instead—either call the cluster-level endpoint
/v1/tasks/<TASK_ID> (preferred) or replace <LOST_HOST_ID> with <HEALTHY_HOST> so
the curl targets a healthy node (/v1/hosts/<HEALTHY_HOST>/tasks/<TASK_ID>), and
update the curl example accordingly.

---

Nitpick comments:
In `@docs/disaster-recovery/full-recovery.md`:
- Around line 330-332: Step 1C.7 is missing the concrete commands to recreate a
container registry and push the Control Plane image using the ARCHIVE_VERSION
variable; update the prose to include an explicit sequence showing how to create
the registry service and then pull/tag/push the Control Plane image using
${ARCHIVE_VERSION} (or, if you prefer not to inline commands, add a clear link
to a registry setup reference), and ensure you reference the ARCHIVE_VERSION
variable defined earlier so operators can copy-paste the exact commands for
"recreate registry" and "build/push Control Plane image" before starting the
Control Plane service in Step 1C.7.
- Around line 461-469: Step 3A.2 currently applies the server-mode cleanup
(removing ${PGEDGE_DATA_DIR}/etcd, /certificates, and generated.config.json) to
all hosts; update this section to mirror partial-recovery.md by distinguishing
server-mode vs client-mode hosts: for server-mode instruct removal of etcd,
certificates, and generated.config.json, and for client-mode instruct removal of
only generated.config.json (use the same labels "server-mode" and "client-mode"
and reference the PGEDGE_DATA_DIR variable and filenames to make the distinction
clear).

Comment on lines +48 to +51
```bash
RECOVERY_HOST_EXTERNAL_IP="<recovery-host-external-ip>" # e.g., 192.168.105.4
ARCHIVE_VERSION="<control-plane-version>" # e.g., 0.6.2
```
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

ARCHIVE_VERSION is defined but never referenced in the document.

No code block anywhere in the guide uses ${ARCHIVE_VERSION}. Either add the concrete registry-rebuild commands that consume it (see also Step 1C.7 below), or remove it from the variable block to avoid confusing operators.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/disaster-recovery/full-recovery.md` around lines 48 - 51, The variable
ARCHIVE_VERSION is declared but unused; either remove the
ARCHIVE_VERSION="<control-plane-version>" line from the env block or add the
registry-rebuild commands that consume it (e.g., the registry-rebuild/registry
restore commands referenced in Step 1C.7) so the guide actually uses
${ARCHIVE_VERSION}; update the examples and any subsequent commands to reference
ARCHIVE_VERSION in place of a hard-coded version to make the variable
meaningful.

Comment on lines +256 to +261
```
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS
4aoqjp3q8jcny4kec5nadcn6x lima-host-1 Down Active Unreachable
959g9937i62judknmr40kcw9r * lima-host-2 Ready Active Leader
l0l51d890edg3f0ccd0xppw06 lima-host-3 Down Active Unreachable
```
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add a language specifier to the fenced code block.

Same MD040 issue as in partial-recovery.md — the docker node ls output block lacks a language tag.

✏️ Proposed fix
-```
+```text
 ID                            HOSTNAME      STATUS    AVAILABILITY   MANAGER STATUS
 4aoqjp3q8jcny4kec5nadcn6x     lima-host-1   Down      Active         Unreachable
 959g9937i62judknmr40kcw9r *   lima-host-2   Ready     Active         Leader
 l0l51d890edg3f0ccd0xppw06     lima-host-3   Down      Active         Unreachable
🧰 Tools
🪛 markdownlint-cli2 (0.21.0)

[warning] 256-256: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/disaster-recovery/full-recovery.md` around lines 256 - 261, The fenced
code block showing the `docker node ls` output is missing a language specifier
which triggers MD040; update the triple-backtick fence for that block in
full-recovery.md to include a language tag such as text (e.g., change ``` to
```text) so the block is explicitly marked and linting passes; locate the block
containing the `ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS` table and add
the language specifier to the opening fence.

Comment on lines +105 to +110
```
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS
4aoqjp3q8jcny4kec5nadcn6x * lima-host-1 Ready Active Leader
959g9937i62judknmr40kcw9r lima-host-2 Ready Active Reachable
l0l51d890edg3f0ccd0xppw06 lima-host-3 Down Active Unreachable
```
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add a language specifier to the fenced code block.

The docker node ls example output uses a plain fence, which triggers MD040. Use text or console to silence the warning and improve rendering.

✏️ Proposed fix
-```
+```text
 ID                            HOSTNAME      STATUS    AVAILABILITY   MANAGER STATUS
 4aoqjp3q8jcny4kec5nadcn6x *   lima-host-1   Ready     Active         Leader
 959g9937i62judknmr40kcw9r     lima-host-2   Ready     Active         Reachable
 l0l51d890edg3f0ccd0xppw06     lima-host-3   Down      Active         Unreachable
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
```
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS
4aoqjp3q8jcny4kec5nadcn6x * lima-host-1 Ready Active Leader
959g9937i62judknmr40kcw9r lima-host-2 Ready Active Reachable
l0l51d890edg3f0ccd0xppw06 lima-host-3 Down Active Unreachable
```
🧰 Tools
🪛 markdownlint-cli2 (0.21.0)

[warning] 105-105: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/disaster-recovery/partial-recovery.md` around lines 105 - 110, Add a
language specifier to the fenced code block containing the `docker node ls`
example (the block showing ID, HOSTNAME, STATUS, AVAILABILITY, MANAGER STATUS)
to silence MD040 and improve rendering; update the opening fence from ``` to
```text (or ```console) so the block becomes a labeled code block.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Wrong step reference — "(force remove)" belongs to Step 1.2, not Step 1.1.

Step 1.1 updates databases; the host record deletion (force remove) happens in Step 1.2.

✏️ Proposed fix
-**Solution:** Complete Step 1.1 (force remove) and wait for the task to finish before rejoining.
+**Solution:** Complete Steps 1.1 and 1.2 (remove databases and force-remove the host) and wait for the task to finish before rejoining.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
**Solution:** Complete Step 1.1 (force remove) and wait for the task to finish before rejoining.
**Solution:** Complete Steps 1.1 and 1.2 (remove databases and force-remove the host) and wait for the task to finish before rejoining.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/disaster-recovery/partial-recovery.md` at line 403, The step label is
incorrect: remove the "(force remove)" note from Step 1.1 and add it to the Step
1.2 text so the host deletion action is documented under Step 1.2 (leave Step
1.1 describing only database updates); update the sentence for Step 1.2 to read
something like "Step 1.2 — force remove (host record deletion) and wait for the
task to finish before rejoining" so the force-remove action is clearly
associated with Step 1.2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments