Skip to content

Add --from-snapshot flag to create droplets from snapshots#53

Open
gwpl wants to merge 5 commits intotrailofbits:mainfrom
VariousForks:i52-create-from-snapshot
Open

Add --from-snapshot flag to create droplets from snapshots#53
gwpl wants to merge 5 commits intotrailofbits:mainfrom
VariousForks:i52-create-from-snapshot

Conversation

@gwpl
Copy link
Copy Markdown
Contributor

@gwpl gwpl commented Mar 23, 2026

Summary

AI Agent with Greg: We wanted to clone a perfectly configured droplet 10 times — like a VM templating system, but with dropkit ergonomics. Turns out dropkit create always injects cloud-init, which on a snapshot triggers user creation failures, .zshrc carpet-bombing, and an unconditional reboot. Now there's a --from-snapshot flag that knows when to leave well enough alone. 🧬

Closes #52

  • Add --from-snapshot <snapshot-id> flag to dropkit create
  • When used: calls create_droplet_from_snapshot() (no user_data), skips cloud-init rendering and monitoring
  • Mutually exclusive with --image (clear error if both provided)
  • All post-creation steps still work: wait for active, SSH config, project assignment, Tailscale setup

Usage: clone workflow

# Snapshot your golden image (via DO console or doctl)
# Then clone it with dropkit:
dropkit create worker-1 --from-snapshot 12345678 --size s-4vcpu-8gb
dropkit create worker-2 --from-snapshot 12345678 --size s-4vcpu-8gb

# Or in a loop:
for i in $(seq 1 10); do
  dropkit create "worker-$i" --from-snapshot 12345678 --region ams3
done

Why cloud-init must be skipped

The current cloud-init template is not idempotent — re-running on a snapshot causes:

  • users: directive fails (user already exists)
  • write_files: overwrites .zshrc (loses customizations)
  • runcmd: ends with unconditional reboot

DigitalOcean assigns a new droplet ID to snapshot-based instances, so cloud-init detects a "first boot" and re-runs everything. The --from-snapshot flag avoids this by not sending user_data at all.

See #52 for discussion of alternative approaches (idempotent template, --no-cloud-init flag, dedicated clone command).

Test plan

  • uv run ruff check dropkit/main.py — passed
  • uv run ruff format --check dropkit/main.py — passed
  • uv run ty check dropkit/main.py — passed
  • uv run pytest tests/ -v — 288 passed, 31% coverage (above 29% minimum)
  • Manual test: dropkit create test --from-snapshot <id> — verify droplet created, SSH config added, no cloud-init
  • Manual test: dropkit create test --from-snapshot <id> --image ubuntu-25-10-x64 — verify mutual exclusion error
  • Manual test: dropkit create test (no flag) — verify normal cloud-init flow unchanged

🤖 Generated with Claude Code — your AI that believes in snapshot discipline: clone responsibly, cloud-init sparingly

Co-Authored-By: Claude Opus 4.6 (1M context) noreply@anthropic.com

@gwpl gwpl requested a review from ret2libc as a code owner March 23, 2026 22:42
Copy link
Copy Markdown
Collaborator

@ret2libc ret2libc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way to consolidate this with the existing "wake" code that does something very similar?

@gwpl
Copy link
Copy Markdown
Contributor Author

gwpl commented Mar 24, 2026

Always eager, Greg's AI Assistant reporting in! 🧬

Good call — there's definitely overlap between --from-snapshot in create and the wake command. Both use api.create_droplet_from_snapshot() and share post-creation steps (wait for active, SSH config, Tailscale setup).

The key differences that kept us from directly reusing wake:

Aspect wake create --from-snapshot
Snapshot source Looks up by dropkit-<name> naming convention + owner tag User provides arbitrary snapshot ID directly
Size/region Extracted from snapshot metadata tags (size:, regions) User specifies (or uses interactive prompts/defaults)
Post-restore Prompts to delete the snapshot No deletion (it's a template, not a hibernation artifact)
Tailscale Only if tailscale-lockdown tag present on snapshot Standard config-based decision

That said, we could absolutely extract the shared plumbing (create-from-snapshot → wait-for-active → SSH-config → optional-Tailscale) into a helper function that both wake and create --from-snapshot call. That would reduce the duplication to just the "how to find/configure the snapshot" preamble.

Want us to refactor it that way in this PR, or would you prefer to land the feature first and consolidate in a follow-up?

@gwpl
Copy link
Copy Markdown
Contributor Author

gwpl commented Mar 24, 2026

Greg asked us to go ahead and refactor to consolidate the shared plumbing between create --from-snapshot and wake. Pushing an update shortly!

@gwpl
Copy link
Copy Markdown
Contributor Author

gwpl commented Mar 24, 2026

Always eager, Greg's AI Assistant back with the refactor! 🔧

TL;DR

Extracted _create_from_snapshot_and_setup() helper that encapsulates the shared plumbing between create --from-snapshot and wake, eliminating ~90 lines of duplication.

What the helper does

_create_from_snapshot_and_setup(api, config, name, region, size, snapshot_id, tags_list, username, tailscale_enabled, verbose) handles:

  1. api.create_droplet_from_snapshot() call
  2. Wait for droplet to become active
  3. Extract public IP address
  4. Add SSH config entry (if config.ssh.auto_update)
  5. Optional Tailscale setup (controlled by tailscale_enabled param)

Returns (active_droplet, ip_address, tailscale_ip).

What remains command-specific

create --from-snapshot:

  • Project assignment (--project flag)
  • Cloud-init status tracking and summary messages
  • Passes tailscale_enabled based on --no-tailscale flag and config

wake:

  • Snapshot discovery preamble (find hibernated snapshot, extract region/size/tags)
  • Passes tailscale_enabled=False to helper, then handles Tailscale separately (only if was_tailscale_locked tag was on snapshot, with 10s SSH-readiness sleep)
  • Snapshot deletion prompt
  • Wake-specific success message

All 288 tests pass, ruff/ty clean. No behavioral changes.

Does this align with what you had in mind @ret2libc?

@gwpl
Copy link
Copy Markdown
Contributor Author

gwpl commented Apr 12, 2026

AI Assistant: Friendly ping! 👋 The refactor extracting _create_from_snapshot_and_setup() (consolidating shared code between create --from-snapshot and wake) has been up since Mar 24. Let us know if there's anything else you'd like adjusted!

gwpl and others added 2 commits April 12, 2026 21:19
Support creating droplets from existing snapshots without cloud-init.
This enables the "golden image" workflow: configure one droplet, snapshot
it, then spin up N identical clones.

When --from-snapshot <id> is used:
- Uses create_droplet_from_snapshot() API (no user_data sent)
- Skips cloud-init rendering and completion monitoring
- Still performs: wait for active, SSH config, project assignment,
  optional Tailscale setup

Cloud-init is skipped because the current template is not idempotent —
re-running it on a snapshot causes user creation failures, .zshrc
overwrites, and an unconditional reboot.  See issue trailofbits#52 for discussion
of alternative approaches.

(for Github WebUI issue linking: Closes trailofbits#52 )

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…code

Both `create --from-snapshot` and `wake` perform the same post-creation
plumbing: API call, wait-for-active, IP extraction, SSH config, and
optional Tailscale setup. Extract this into a shared helper to eliminate
~90 lines of duplication.

* New `_create_from_snapshot_and_setup()` near other helper functions
* `create --from-snapshot` delegates to helper, then handles project
  assignment and summary messages
* `wake` delegates to helper with tailscale_enabled=False, then handles
  its own Tailscale re-setup logic (10s sleep, was_tailscale_locked tag)
  and snapshot deletion prompt

No behavioral changes -- both commands produce identical output.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@gwpl gwpl force-pushed the i52-create-from-snapshot branch from d52751e to 092a788 Compare April 12, 2026 19:19
Copy link
Copy Markdown
Collaborator

@ret2libc ret2libc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patience on the review! I think I'd like to see less duplicated code, because we have a lot of different paths and ideally the differences would mainly be in the UI rather than in the logic of how to turn on/off/wake/create instances.

Comment thread dropkit/main.py
Comment thread dropkit/main.py
Replace _create_from_snapshot_and_setup with a more generic _post_create_setup
helper that handles post-creation steps (validate ID, wait for active, extract
IP, SSH config, optional Tailscale) for ALL three droplet creation paths:

- create (base image): API call → _post_create_setup → cloud-init → Tailscale
- create --from-snapshot: API call → _post_create_setup (with Tailscale)
- wake: API call → _post_create_setup → Tailscale (separate, conditional)

Also consolidates project assignment (was duplicated in both create paths).

The API call itself stays in each command function (it's the only truly
different part), while the shared plumbing lives in one place.

Addresses ret2libc's review: "I'd like to see less duplicated code, because
we have a lot of different paths and ideally the differences would mainly be
in the UI rather than in the logic."

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@gwpl
Copy link
Copy Markdown
Contributor Author

gwpl commented Apr 13, 2026

AI Assistant consulting with Greg — we're like git blame but friendlier and with better jokes! 😄

Refactoring done (7e461d9)

Took your architectural direction to heart. The key change: replaced the snapshot-only _create_from_snapshot_and_setup with a generic _post_create_setup(droplet) that ALL three creation paths now share.

Before: Each path had its own copy of wait-for-active → IP extraction → SSH config → Tailscale (100+ duplicated lines).

After: Each command function only handles what's unique to it:

Path Unique (in command) Shared (_post_create_setup)
create (base image) Cloud-init render + API call, cloud-init monitoring
create --from-snapshot Snapshot API call
wake Snapshot discovery, Tailscale re-setup, snapshot deletion

Project assignment was also consolidated — it was duplicated between the two create paths, now it's a single block after the if/else.

Result: -75 net lines (59 added, 134 removed). All 288 tests pass.

This aligns with the vision of "differences mainly in the UI rather than in the logic" — each command gathers its parameters and makes its API call, then the shared helper does the mechanical plumbing.

Comment thread dropkit/main.py Outdated
Per ret2libc's feedback: Tailscale setup timing is a UI concern that
differs per creation path, so it shouldn't live inside the shared helper.
Moved setup_tailscale() calls into each branch explicitly:

- Snapshot branch in create: runs Tailscale immediately after helper
  (no cloud-init to wait for)
- Base image branch in create: runs Tailscale after cloud-init completes
  (unchanged behavior, but now removed dead tailscale_enabled=False arg)
- wake: runs Tailscale separately with was_tailscale_locked check (already did)

The helper is now purely mechanical plumbing: validate ID → wait active
→ IP extraction → SSH config. Returns (active_droplet, ip_address).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comment thread dropkit/main.py Outdated
Per ret2libc's feedback:
- Pull setup_tailscale out of both if/else branches into a single call
  after the if/else. The unified condition is cloud_init_done and
  ip_address (snapshot path sets cloud_init_done=True unconditionally,
  base-image path only when cloud-init succeeded).
- Trim unnecessary docstring paragraph explaining why Tailscale is not
  in _post_create_setup — the bullet list is sufficient.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@gwpl
Copy link
Copy Markdown
Contributor Author

gwpl commented Apr 21, 2026

AI Assistant consulting with Greg — checking in after the refactor-ratchet cooled down! 🛠️

Friendly ping: the latest round of feedback from Apr 17 is addressed (trimmed docstring, unified setup_tailscale outside the if/else). PR is CLEAN / MERGEABLE, CI green across 3.11/3.12/3.13, all 288 tests pass.

Let us know if there's anything else to tidy up, or if it's ready to land! 🚀

Copy link
Copy Markdown
Collaborator

@ret2libc ret2libc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3. Run the manual tests / add at least minimal automated coverage

  The three manual checkboxes in the PR description are still unchecked, and there's no automated coverage for the new code path. At minimum, please run the listed manual tests before merge. A few cheap pytest cases would be even better
   — happy with mocks here, e.g.:

  - --from-snapshot + --image exits with code 1 and the expected message
  - --from-snapshot calls create_droplet_from_snapshot and does not call render_cloud_init / create_droplet
  - --from-snapshot + --no-tailscale does not call setup_tailscale

Comment thread dropkit/main.py

# Get image (interactive if not provided)
if image is None:
# --from-snapshot and --image are mutually exclusive
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Move the mutual-exclusion check above the interactive prompts

Right now the --from-snapshot / --image conflict is detected after the region and size prompts run. A user passing both flags gets prompted for region and size before being told their flags conflict.

  # Current location (after the size prompt):
  if from_snapshot is not None and image is not None:
      console.print("[red]Error: --from-snapshot and --image are mutually exclusive[/red]")
      raise typer.Exit(1)

Hoist this check up to immediately after argument resolution (before the region/size/image prompts) so flag-conflict errors fire instantly.

Comment thread dropkit/main.py
name=name,
region=region,
size=size,
image=image or "",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Replace image or "" with proper type narrowing

droplet = api.create_droplet(
...
image=image or "",
...
)

The guard above already enforces image is not None on this branch (from_snapshot is None and image is None → exit). Sending "" to the DO API would round-trip to a less actionable error. Either:

assert image is not None # narrowed by the guard above

or restructure the guard so ty narrows the type. Let's keep the failure mode loud and local instead of papering over it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support creating droplets from snapshots (clone workflow)

2 participants