Docs: add disaster recovery guides for partial and complete failure#268
Docs: add disaster recovery guides for partial and complete failure#268moizpgedge wants to merge 11 commits intomainfrom
Conversation
📝 WalkthroughWalkthroughAdds comprehensive disaster recovery documentation for the Control Plane: an index plus two guides (partial recovery with quorum intact and complete recovery with no quorum). Documents phased workflows, procedures for etcd and Docker Swarm recovery, prerequisites, verification, and troubleshooting. No code or public API changes. Changes
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (4)
docs/disaster-recovery/full-recovery.md (2)
288-295: Add a language specifier to the fenced code block.Same as in partial-recovery.md — the
docker node lsexample output block should use```textor```console.Proposed fix
-``` +```text ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/disaster-recovery/full-recovery.md` around lines 288 - 295, Update the fenced code block showing the docker node ls example output to include a language specifier (e.g., use ```text or ```console instead of plain ```) so syntax-aware renderers treat it as plain text; locate the fenced block containing the sample output for "docker node ls" (the block starting with the ID/HOSTNAME/STATUS header) and replace the opening fence with ```text (or ```console) accordingly.
204-210: Phase 1B backup step only preserves etcd directory, unlike Phase 1A.Phase 1A Step 1A.3 backs up etcd data, certificates, and
generated.config.json. Phase 1B Step 1B.1 only backs up the etcd directory. Since the surviving host in 1B still has valid certificates and config, this is likely intentional — but a brief note explaining the difference would help operators avoid second-guessing during a high-stress recovery.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/disaster-recovery/full-recovery.md` around lines 204 - 210, Phase 1B Step 1B.1 only backs up the etcd directory whereas Phase 1A Step 1A.3 also backs up certificates and generated.config.json; add a brief explanatory note under "Step 1B.1: Backup Existing etcd Data" clarifying that in Phase 1B the surviving host is expected to already have valid certificates and configuration so only the etcd data is preserved, and reference Step 1A.3 for the full backup behavior to avoid operator confusion.docs/disaster-recovery/partial-recovery.md (1)
134-140: Add a language specifier to the fenced code block.The example
docker node lsoutput block on Line 135 lacks a language identifier. Use```textor```consolefor plain output blocks to satisfy the MD040 lint rule and improve rendering consistency.Proposed fix
-``` +```text ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS 4aoqjp3q8jcny4kec5nadcn6x * lima-host-1 Ready Active Leader 959g9937i62judknmr40kcw9r lima-host-2 Ready Active Reachable l0l51d890edg3f0ccd0xppw06 lima-host-3 Down Active Unreachable</details> <details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against the current code and only fix it if needed.
In
@docs/disaster-recovery/partial-recovery.mdaround lines 134 - 140, The
fenced code block that shows the docker node ls output (the block starting with
"ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS" in the partial-recovery
example) is missing a language specifier and triggers MD040; update that block
by adding a plain-output language tag such astext orconsole immediately
after the opening backticks so the snippet becomes a labelled code fence (e.g.,
changetotext) to satisfy linting and rendering.</details> </blockquote></details> <details> <summary>mkdocs.yml (1)</summary><blockquote> `67-69`: **Consider adding `disaster-recovery/index.md` as the section index page.** Other nav sections (e.g., "Prerequisites", "Installing Control Plane") include their `index.md` as the first entry, making the section header clickable. Currently, "Disaster Recovery" is just a label and `index.md` isn't reachable from the nav, even though it exists and contains useful context (the backup warning). <details> <summary>Proposed fix</summary> ```diff - Disaster Recovery: + - disaster-recovery/index.md - Partial Failure Recovery (Quorum Intact): disaster-recovery/partial-recovery.md - Complete Failure Recovery (No Quorum): disaster-recovery/full-recovery.md ``` </details> <details> <summary>🤖 Prompt for AI Agents</summary> ``` Verify each finding against the current code and only fix it if needed. In `@mkdocs.yml` around lines 67 - 69, The "Disaster Recovery" nav entry should include its section index so the header becomes clickable; update the mkdocs.yml nav under the "Disaster Recovery" section to insert "disaster-recovery/index.md" as the first entry before "disaster-recovery/partial-recovery.md" and "disaster-recovery/full-recovery.md" (so the section reads index.md, partial-recovery.md, full-recovery.md). ``` </details> </blockquote></details> </blockquote></details> <details> <summary>🤖 Prompt for all review comments with AI agents</summary>Verify each finding against the current code and only fix it if needed.
Inline comments:
In@docs/disaster-recovery/full-recovery.md:
- Around line 385-397: The docs show a port mismatch: the registry is published
on port 5000 but the control plane service sets
PGEDGE_DOCKER_SWARM__IMAGE_REPOSITORY_HOST to 127.0.0.1:5001/pgedge; update the
environment variable in the control-plane service creation
(PGEDGE_DOCKER_SWARM__IMAGE_REPOSITORY_HOST in the
control-plane_${RECOVERY_HOST_ID} service) to 127.0.0.1:5000/pgedge to match the
registry, or explicitly document why 5001 is used (e.g., mirror/proxy) if that
is intentional.In
@docs/disaster-recovery/partial-recovery.md:
- Around line 76-82: The doc currently shows monitoring the host-removal task
via /v1/hosts/<LOST_HOST_ID>/tasks/<TASK_ID>, but since the host was
force-removed this endpoint may be unavailable; update the instructions to use a
cluster-level or recovery-host-scoped task endpoint (e.g., /v1/tasks/<TASK_ID>
or /v1/hosts/<RECOVERY_HOST_ID>/tasks/<TASK_ID>) or explicitly state that
/v1/hosts/<LOST_HOST_ID>/tasks/<TASK_ID> remains accessible after
deletion—replace the existing reference to
/v1/hosts/<LOST_HOST_ID>/tasks/<TASK_ID> accordingly and mention the alternative
endpoints.
Nitpick comments:
In@docs/disaster-recovery/full-recovery.md:
- Around line 288-295: Update the fenced code block showing the docker node ls
example output to include a language specifier (e.g., usetext orconsole
instead of plain) so syntax-aware renderers treat it as plain text; locate the fenced block containing the sample output for "docker node ls" (the block starting with the ID/HOSTNAME/STATUS header) and replace the opening fence withtext (or ```console) accordingly.- Around line 204-210: Phase 1B Step 1B.1 only backs up the etcd directory
whereas Phase 1A Step 1A.3 also backs up certificates and generated.config.json;
add a brief explanatory note under "Step 1B.1: Backup Existing etcd Data"
clarifying that in Phase 1B the surviving host is expected to already have valid
certificates and configuration so only the etcd data is preserved, and reference
Step 1A.3 for the full backup behavior to avoid operator confusion.In
@docs/disaster-recovery/partial-recovery.md:
- Around line 134-140: The fenced code block that shows the docker node ls
output (the block starting with "ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS"
in the partial-recovery example) is missing a language specifier and triggers
MD040; update that block by adding a plain-output language tag such astext orconsole immediately after the opening backticks so the snippet becomes a
labelled code fence (e.g., changetotext) to satisfy linting and
rendering.In
@mkdocs.yml:
- Around line 67-69: The "Disaster Recovery" nav entry should include its
section index so the header becomes clickable; update the mkdocs.yml nav under
the "Disaster Recovery" section to insert "disaster-recovery/index.md" as the
first entry before "disaster-recovery/partial-recovery.md" and
"disaster-recovery/full-recovery.md" (so the section reads index.md,
partial-recovery.md, full-recovery.md).</details> <!-- This is an auto-generated comment by CodeRabbit for review status -->
There was a problem hiding this comment.
Task monitoring endpoint references the removed host.
Line 78 monitors the host removal task via /v1/hosts/<LOST_HOST_ID>/tasks/<TASK_ID>, but the host was just force-removed in Step 1.1. This endpoint may not be available after deletion. Consider using a cluster-level or recovery-host-scoped task endpoint instead, or clarify the expected behavior (e.g., whether the task endpoint remains accessible for a removed host).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/disaster-recovery/partial-recovery.md` around lines 76 - 82, The doc
currently shows monitoring the host-removal task via
/v1/hosts/<LOST_HOST_ID>/tasks/<TASK_ID>, but since the host was force-removed
this endpoint may be unavailable; update the instructions to use a cluster-level
or recovery-host-scoped task endpoint (e.g., /v1/tasks/<TASK_ID> or
/v1/hosts/<RECOVERY_HOST_ID>/tasks/<TASK_ID>) or explicitly state that
/v1/hosts/<LOST_HOST_ID>/tasks/<TASK_ID> remains accessible after
deletion—replace the existing reference to
/v1/hosts/<LOST_HOST_ID>/tasks/<TASK_ID> accordingly and mention the alternative
endpoints.
jason-lynch
left a comment
There was a problem hiding this comment.
This is getting closer! There are still a lot of repeated steps, and a few places where we're doing the same things in different ways. I think we can do more to deduplicate and simplify these documents. Could this just be one document?:
- Disaster recovery
- Restoring Docker Swarm
- Reinitializing the Swarm
- Joining hosts to the new Swarm
- Removing old Swarm nodes
- Restoring the Control Plane
- Reinitializing the Control Plane Etcd cluster
- Updating databases to remove old hosts
- Removing old hosts
- Re-adding hosts
- Updating databases to re-add hosts
- Restoring Docker Swarm
mkdocs.yml
Outdated
There was a problem hiding this comment.
Could you please remove this for now? I think @mmols had suggested that we go through QA and documentation review before we publish this to the site.
There was a problem hiding this comment.
"Registry recreation" and "image rebuild" are exclusive to our development environment, so we should not document them in our user-facing documentation.
There was a problem hiding this comment.
What are the conditions where it becomes necessary to do this?
There was a problem hiding this comment.
We can't write specific instructions for this step because it's very dependent on the user's environment. You can simulate it locally using cp or Lima's snapshot tool, but in a real environment it will look quite different. Here are some examples of different procedures that users might follow:
- https://docs.aws.amazon.com/prescriptive-guidance/latest/backup-recovery/restore.html
- https://techdocs.broadcom.com/us/en/vmware-cis/vsphere/container-storage-plugin/3-0/getting-started-with-vmware-vsphere-container-storage-plug-in-3-0/using-vsphere-container-storage-plug-in/volume-snapshot-and-restore.html
- https://learn.microsoft.com/en-us/azure/backup/backup-azure-arm-restore-vms
- https://docs.cloud.google.com/compute/docs/disks/restore-snapshot
We can't document every possible way of restoring a volume or an instance from a backup, so I wouldn't get more specific than this first sentence.
There was a problem hiding this comment.
This will look better split onto two lines:
| curl -L https://github.com/etcd-io/etcd/releases/download/v3.6.5/etcd-v3.6.5-linux-${ARCH}.tar.gz | tar --strip-components 1 -xz -C /tmp etcd-v3.6.5-linux-${ARCH}/etcdutl | |
| curl -L https://github.com/etcd-io/etcd/releases/download/v3.6.5/etcd-v3.6.5-linux-${ARCH}.tar.gz \ | |
| | tar --strip-components 1 -xz -C /tmp etcd-v3.6.5-linux-${ARCH}/etcdutl |
There was a problem hiding this comment.
This is not important. The order should not matter.
There was a problem hiding this comment.
Healthy client-mode hosts do not need to be removed at all, so users should never be in this position.
There was a problem hiding this comment.
Please remove these Lima-specific instructions. This is something that's specific to our development environment, so these instructions are not meaningful to users,
| If Swarm no longer has the service definition: | ||
|
|
||
| ```bash | ||
| docker stack deploy -c <path-to-stack-yaml> control-plane |
There was a problem hiding this comment.
This is the first time we mention the stack definition in this document. The stack definition should be listed as a pre-requisite so that users know to prepare it ahead of time.
Also - keep in mind that doing this deploy will recreate and scale up all services in the stack. I would just do this one time after I've gotten all of my hosts ready to deploy again.
|
|
||
| --- | ||
|
|
||
| ## Phase 1: Remove the Failed Host |
There was a problem hiding this comment.
Let's be consistent with the full recovery document and say that users should update the database to remove the lost host first, and then remove the host without the force parameter. As I mentioned in a comment on the other doc, I'm leaning towards removing the force parameter altogether from that remove host endpoint because it can lead to problems.
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (2)
docs/disaster-recovery/full-recovery.md (2)
272-277: Fenced code block at line 272 is missing a language specifier (MD040).♻️ Proposed fix
-``` +```text ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS ...🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/disaster-recovery/full-recovery.md` around lines 272 - 277, The fenced code block starting with triple-backticks that contains the table header "ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS" is missing a language specifier; update the opening fence to include a language (e.g., change ``` to ```text or ```console) so the block complies with MD040 and renders/linters correctly, leaving the block contents unchanged.
77-80: Indented code blocks inside admonitions should be fenced (MD046).markdownlint flags five locations (lines 79, 151, 184, 254, 551) where 4-space indented blocks are used inside
!!!admonitions. These may not render as code blocks in all Markdown renderers; fenced blocks are the reliable and consistent approach.Example fix for the pattern (apply similarly at all five locations):
♻️ Proposed fix pattern
!!! important "Reset Cluster Membership for Multi-Node Clusters" - If your cluster previously had more than one node, you **must** use `etcdutl snapshot restore` to reset cluster membership. Simply copying the etcd directory will not work. + If your cluster previously had more than one node, you **must** use `etcdutl snapshot restore` + to reset cluster membership. Simply copying the etcd directory will not work.For any nested code snippets inside the admonition, use a fenced block indented with 4 spaces:
!!! note - If Docker Swarm is still functional (only etcd lost quorum), use [Phase 1B](`#phase-1b-majority-quorum-loss`) instead. + If Docker Swarm is still functional (only etcd lost quorum), use [Phase 1B](`#phase-1b-majority-quorum-loss`) instead.Also applies to: 149-152, 182-185, 252-255, 549-552
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/disaster-recovery/full-recovery.md` around lines 77 - 80, Replace the 4-space indented code blocks inside the "!!! important" admonitions with fenced code blocks so they render consistently; locate the admonition starting with '!!! important "Reset Cluster Membership for Multi-Node Clusters"' and the other admonitions at the flagged locations (around lines referenced in the review: 79, 151, 184, 254, 551) and change any nested indented blocks to triple-backtick fenced blocks (```), preserving language hints if present and maintaining the same indentation level for the admonition content.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/disaster-recovery/full-recovery.md`:
- Around line 200-213: Phase 1B currently contradicts itself: "Option A"
instructs running etcdctl snapshot save against a live endpoint (the etcdctl
snapshot save command) even though Phase 1B assumes majority quorum loss and
etcd is inaccessible; either delete the entire "Option A: Create Snapshot from
Existing Data" subsection or rewrite it to clearly define a distinct
sub-scenario (e.g., "Partial Quorum: single surviving node with etcd running")
that is explicitly separate from the majority-quorum-lost path, and in that
rewrite explain preconditions, when to run the etcdctl snapshot save command,
and add a cross-reference to "Option B" (pre-existing snapshot) for the true
quorum-loss recovery path so readers are not misled.
- Line 51: The document contains literal draft comments using C-style "// To Do"
text (e.g. "// TO DO link the installation.md file section here", "// To Do A
snapshot file is optional...", "// To Do The title of this section, "Majority
Quorum Loss"...", "// To Do a snapshot is optional...", "// To Do use etcdutl
snapshot restore", "// To Do we will keep one section in final doc") which
render verbatim and trigger MD046; remove or resolve each by either deleting
them, converting them to proper HTML comments (<!-- TODO: ... -->) or replacing
them with real content or a Markdown TODO/task list or an actual link (e.g.
install section link) so the draft notes no longer appear in the published
output and the lint warning is eliminated.
- Around line 315-346: Phase 1C is incomplete: add concrete Step 1C.4 and 1C.5
and replace the Step 1C.6 TODO with the etcd ForceNewCluster restore steps, then
update the dangling cross-reference and summary table. Specifically, create Step
1C.4 "Recreate container registry on surviving host" that documents rebuilding
or restoring the registry (recreate registry container, load required
images/tags, push images back into registry) and reference the surviving
host/node names (e.g., host-2/n2); add Step 1C.5 "Prepare etcd snapshot and
prerequisites" describing how to fetch/transfer the snapshot, validate it, stop
etcd on recovery nodes and prepare data dirs; replace Step 1C.6 with a full
"ForceNewCluster etcdutl snapshot restore" sequence (use etcdutl snapshot
restore with --force-new-cluster and correct --name/--initial-cluster settings,
restore to expected data-dir and systemd restart steps) and ensure Step 1C.7
remains the verification curl. Finally, update the Common Issues cross-reference
that mentions "Phase 1C, Step 1C.4" and the Summary table range/actions to
include the rebuilt registry and ForceNewCluster restore so documentation is
consistent.
- Around line 367-369: Update the Step 2.1 curl example so the URL uses the same
port variable as the rest of the document: replace the hardcoded ":3000" in the
curl command URL (the line starting with curl -X POST
"http://${RECOVERY_HOST_IP}:3000/...) with the template variable ":${API_PORT}"
to match other API calls and maintain consistency.
- Around line 37-42: The Set Variables block is missing RECOVERY_HOST_ID and
API_PORT which are referenced throughout the guide; update the initial variables
list to include RECOVERY_HOST_ID (the identifier/name of the recovery host used
in commands and SSH/etcd membership steps) and API_PORT (the control-plane API
port used by the recovery host), and ensure the examples show sensible
placeholders like RECOVERY_HOST_ID="<recovery-host-id>" and API_PORT=<api-port>
so all subsequent snippets (references to RECOVERY_HOST_ID and API_PORT) work
without reader guesswork; verify mentions in functions/sections that reference
RECOVERY_HOST_ID and API_PORT still match the variable names exactly.
- Around line 130-133: Remove the trailing space after the backslash in the curl
line so the shell line continuation works (ensure the backslash is the last
character before the newline in the "curl -L
https://github.com/etcd-io/etcd/releases/download/v3.6.5/etcd-v3.6.5-linux-${ARCH}.tar.gz
\ | tar --strip-components 1 -xz -C /tmp etcd-v3.6.5-linux-${ARCH}/etcdutl"
snippet) or collapse into a single line; also update the hardcoded release tag
from v3.6.5 to v3.6.8 or replace it with a VERSION variable so the etcd download
URL and the etcd-v... names stay consistent and current, and verify the
subsequent sudo mv/ chmod lines still reference the same etcdutl filename.
---
Nitpick comments:
In `@docs/disaster-recovery/full-recovery.md`:
- Around line 272-277: The fenced code block starting with triple-backticks that
contains the table header "ID HOSTNAME STATUS
AVAILABILITY MANAGER STATUS" is missing a language specifier; update the
opening fence to include a language (e.g., change ``` to ```text or ```console)
so the block complies with MD040 and renders/linters correctly, leaving the
block contents unchanged.
- Around line 77-80: Replace the 4-space indented code blocks inside the "!!!
important" admonitions with fenced code blocks so they render consistently;
locate the admonition starting with '!!! important "Reset Cluster Membership for
Multi-Node Clusters"' and the other admonitions at the flagged locations (around
lines referenced in the review: 79, 151, 184, 254, 551) and change any nested
indented blocks to triple-backtick fenced blocks (```), preserving language
hints if present and maintaining the same indentation level for the admonition
content.
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (2)
docs/disaster-recovery/full-recovery.md (2)
330-332: Step 1C.7 lacks concrete commands for registry recreation — the hardest step in Phase 1C.This is the only recovery path for operators who have lost both etcd and Swarm quorum. The step instructs them to "recreate the registry and build/push the Control Plane image" but provides no commands, and the
ARCHIVE_VERSIONvariable defined in Set Variables (line 50) is clearly intended for exactly these commands. An operator in a production incident cannot act on prose alone.Consider adding a code block with the docker registry/image commands (e.g.,
docker service createfor the registry,docker pull/docker tag/docker pushusing${ARCHIVE_VERSION}), or at minimum link to a dedicated registry setup reference.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/disaster-recovery/full-recovery.md` around lines 330 - 332, Step 1C.7 is missing the concrete commands to recreate a container registry and push the Control Plane image using the ARCHIVE_VERSION variable; update the prose to include an explicit sequence showing how to create the registry service and then pull/tag/push the Control Plane image using ${ARCHIVE_VERSION} (or, if you prefer not to inline commands, add a clear link to a registry setup reference), and ensure you reference the ARCHIVE_VERSION variable defined earlier so operators can copy-paste the exact commands for "recreate registry" and "build/push Control Plane image" before starting the Control Plane service in Step 1C.7.
461-469: Step 3A.2 doesn't distinguish server-mode from client-mode hosts, unlike the partial-recovery guide.
partial-recovery.mdStep 3A.2 has separate blocks for server-mode (clearetcd+certificates+generated.config.json) and client-mode (clear onlygenerated.config.json). This guide applies the server-mode commands to all hosts unconditionally. Whilerm -rfon absent directories is harmless, aligning with the partial-recovery guide's pattern avoids operator confusion when working across both docs.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/disaster-recovery/full-recovery.md` around lines 461 - 469, Step 3A.2 currently applies the server-mode cleanup (removing ${PGEDGE_DATA_DIR}/etcd, /certificates, and generated.config.json) to all hosts; update this section to mirror partial-recovery.md by distinguishing server-mode vs client-mode hosts: for server-mode instruct removal of etcd, certificates, and generated.config.json, and for client-mode instruct removal of only generated.config.json (use the same labels "server-mode" and "client-mode" and reference the PGEDGE_DATA_DIR variable and filenames to make the distinction clear).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/disaster-recovery/full-recovery.md`:
- Around line 48-51: The variable ARCHIVE_VERSION is declared but unused; either
remove the ARCHIVE_VERSION="<control-plane-version>" line from the env block or
add the registry-rebuild commands that consume it (e.g., the
registry-rebuild/registry restore commands referenced in Step 1C.7) so the guide
actually uses ${ARCHIVE_VERSION}; update the examples and any subsequent
commands to reference ARCHIVE_VERSION in place of a hard-coded version to make
the variable meaningful.
- Around line 256-261: The fenced code block showing the `docker node ls` output
is missing a language specifier which triggers MD040; update the triple-backtick
fence for that block in full-recovery.md to include a language tag such as text
(e.g., change ``` to ```text) so the block is explicitly marked and linting
passes; locate the block containing the `ID HOSTNAME STATUS AVAILABILITY MANAGER
STATUS` table and add the language specifier to the opening fence.
In `@docs/disaster-recovery/partial-recovery.md`:
- Around line 105-110: Add a language specifier to the fenced code block
containing the `docker node ls` example (the block showing ID, HOSTNAME, STATUS,
AVAILABILITY, MANAGER STATUS) to silence MD040 and improve rendering; update the
opening fence from ``` to ```text (or ```console) so the block becomes a labeled
code block.
- Line 403: The step label is incorrect: remove the "(force remove)" note from
Step 1.1 and add it to the Step 1.2 text so the host deletion action is
documented under Step 1.2 (leave Step 1.1 describing only database updates);
update the sentence for Step 1.2 to read something like "Step 1.2 — force remove
(host record deletion) and wait for the task to finish before rejoining" so the
force-remove action is clearly associated with Step 1.2.
---
Duplicate comments:
In `@docs/disaster-recovery/partial-recovery.md`:
- Around line 88-92: The doc uses the per-host task endpoint
/v1/hosts/<LOST_HOST_ID>/tasks/<TASK_ID> to poll a removal task on a host that
was just deleted; since that host may become unavailable, update the example to
query a stable API host instead—either call the cluster-level endpoint
/v1/tasks/<TASK_ID> (preferred) or replace <LOST_HOST_ID> with <HEALTHY_HOST> so
the curl targets a healthy node (/v1/hosts/<HEALTHY_HOST>/tasks/<TASK_ID>), and
update the curl example accordingly.
---
Nitpick comments:
In `@docs/disaster-recovery/full-recovery.md`:
- Around line 330-332: Step 1C.7 is missing the concrete commands to recreate a
container registry and push the Control Plane image using the ARCHIVE_VERSION
variable; update the prose to include an explicit sequence showing how to create
the registry service and then pull/tag/push the Control Plane image using
${ARCHIVE_VERSION} (or, if you prefer not to inline commands, add a clear link
to a registry setup reference), and ensure you reference the ARCHIVE_VERSION
variable defined earlier so operators can copy-paste the exact commands for
"recreate registry" and "build/push Control Plane image" before starting the
Control Plane service in Step 1C.7.
- Around line 461-469: Step 3A.2 currently applies the server-mode cleanup
(removing ${PGEDGE_DATA_DIR}/etcd, /certificates, and generated.config.json) to
all hosts; update this section to mirror partial-recovery.md by distinguishing
server-mode vs client-mode hosts: for server-mode instruct removal of etcd,
certificates, and generated.config.json, and for client-mode instruct removal of
only generated.config.json (use the same labels "server-mode" and "client-mode"
and reference the PGEDGE_DATA_DIR variable and filenames to make the distinction
clear).
| ```bash | ||
| RECOVERY_HOST_EXTERNAL_IP="<recovery-host-external-ip>" # e.g., 192.168.105.4 | ||
| ARCHIVE_VERSION="<control-plane-version>" # e.g., 0.6.2 | ||
| ``` |
There was a problem hiding this comment.
ARCHIVE_VERSION is defined but never referenced in the document.
No code block anywhere in the guide uses ${ARCHIVE_VERSION}. Either add the concrete registry-rebuild commands that consume it (see also Step 1C.7 below), or remove it from the variable block to avoid confusing operators.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/disaster-recovery/full-recovery.md` around lines 48 - 51, The variable
ARCHIVE_VERSION is declared but unused; either remove the
ARCHIVE_VERSION="<control-plane-version>" line from the env block or add the
registry-rebuild commands that consume it (e.g., the registry-rebuild/registry
restore commands referenced in Step 1C.7) so the guide actually uses
${ARCHIVE_VERSION}; update the examples and any subsequent commands to reference
ARCHIVE_VERSION in place of a hard-coded version to make the variable
meaningful.
| ``` | ||
| ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS | ||
| 4aoqjp3q8jcny4kec5nadcn6x lima-host-1 Down Active Unreachable | ||
| 959g9937i62judknmr40kcw9r * lima-host-2 Ready Active Leader | ||
| l0l51d890edg3f0ccd0xppw06 lima-host-3 Down Active Unreachable | ||
| ``` |
There was a problem hiding this comment.
Add a language specifier to the fenced code block.
Same MD040 issue as in partial-recovery.md — the docker node ls output block lacks a language tag.
✏️ Proposed fix
-```
+```text
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS
4aoqjp3q8jcny4kec5nadcn6x lima-host-1 Down Active Unreachable
959g9937i62judknmr40kcw9r * lima-host-2 Ready Active Leader
l0l51d890edg3f0ccd0xppw06 lima-host-3 Down Active Unreachable🧰 Tools
🪛 markdownlint-cli2 (0.21.0)
[warning] 256-256: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/disaster-recovery/full-recovery.md` around lines 256 - 261, The fenced
code block showing the `docker node ls` output is missing a language specifier
which triggers MD040; update the triple-backtick fence for that block in
full-recovery.md to include a language tag such as text (e.g., change ``` to
```text) so the block is explicitly marked and linting passes; locate the block
containing the `ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS` table and add
the language specifier to the opening fence.
| ``` | ||
| ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS | ||
| 4aoqjp3q8jcny4kec5nadcn6x * lima-host-1 Ready Active Leader | ||
| 959g9937i62judknmr40kcw9r lima-host-2 Ready Active Reachable | ||
| l0l51d890edg3f0ccd0xppw06 lima-host-3 Down Active Unreachable | ||
| ``` |
There was a problem hiding this comment.
Add a language specifier to the fenced code block.
The docker node ls example output uses a plain fence, which triggers MD040. Use text or console to silence the warning and improve rendering.
✏️ Proposed fix
-```
+```text
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS
4aoqjp3q8jcny4kec5nadcn6x * lima-host-1 Ready Active Leader
959g9937i62judknmr40kcw9r lima-host-2 Ready Active Reachable
l0l51d890edg3f0ccd0xppw06 lima-host-3 Down Active Unreachable📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| ``` | |
| ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS | |
| 4aoqjp3q8jcny4kec5nadcn6x * lima-host-1 Ready Active Leader | |
| 959g9937i62judknmr40kcw9r lima-host-2 Ready Active Reachable | |
| l0l51d890edg3f0ccd0xppw06 lima-host-3 Down Active Unreachable | |
| ``` |
🧰 Tools
🪛 markdownlint-cli2 (0.21.0)
[warning] 105-105: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/disaster-recovery/partial-recovery.md` around lines 105 - 110, Add a
language specifier to the fenced code block containing the `docker node ls`
example (the block showing ID, HOSTNAME, STATUS, AVAILABILITY, MANAGER STATUS)
to silence MD040 and improve rendering; update the opening fence from ``` to
```text (or ```console) so the block becomes a labeled code block.
There was a problem hiding this comment.
Wrong step reference — "(force remove)" belongs to Step 1.2, not Step 1.1.
Step 1.1 updates databases; the host record deletion (force remove) happens in Step 1.2.
✏️ Proposed fix
-**Solution:** Complete Step 1.1 (force remove) and wait for the task to finish before rejoining.
+**Solution:** Complete Steps 1.1 and 1.2 (remove databases and force-remove the host) and wait for the task to finish before rejoining.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| **Solution:** Complete Step 1.1 (force remove) and wait for the task to finish before rejoining. | |
| **Solution:** Complete Steps 1.1 and 1.2 (remove databases and force-remove the host) and wait for the task to finish before rejoining. |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/disaster-recovery/partial-recovery.md` at line 403, The step label is
incorrect: remove the "(force remove)" note from Step 1.1 and add it to the Step
1.2 text so the host deletion action is documented under Step 1.2 (leave Step
1.1 describing only database updates); update the sentence for Step 1.2 to read
something like "Step 1.2 — force remove (host record deletion) and wait for the
task to finish before rejoining" so the force-remove action is clearly
associated with Step 1.2.
Summary
Adds disaster recovery documentation: an index plus two guides—Partial Failure Recovery (Quorum Intact) and Complete Failure Recovery (No Quorum)—so operators can recover the Control Plane when hosts or quorum are lost.
Changes
Testing
Checklist
changes/unreleased/Added-disaster-recovery-docs.yaml)Notes for Reviewers
docs/PLAT-313-314-etcd-swarm-disaster-recovery— PLAT-313 and PLAT-314PGEDGE_ETCD_SERVER__FORCE_NEW_CLUSTER=true; worth a careful read.ChangeLog
Entry added in changes/unreleased/Added-disaster-recovery-docs.yaml for the disaster recovery documentation.