Skip to content

Add standby compression start delay#184

Open
sjmiller609 wants to merge 15 commits intomainfrom
codex/standby-compression-delay
Open

Add standby compression start delay#184
sjmiller609 wants to merge 15 commits intomainfrom
codex/standby-compression-delay

Conversation

@sjmiller609
Copy link
Copy Markdown
Collaborator

@sjmiller609 sjmiller609 commented Apr 4, 2026

Summary

  • add a standby-only compression delay override on POST /instances/{id}/standby and a per-instance default in snapshot_policy
  • keep delayed standby compression jobs cancelable before start and distinguish pending-delay skips from active compression cancellation
  • add metrics, logs, traces, OpenAPI updates, and tests for the new standby compression delay behavior

Testing

  • go test ./lib/instances
  • go test ./cmd/api/api -run 'Test(CreateInstance_MapsStandbyCompressionDelayInSnapshotPolicy|CreateInstance_InvalidStandbyCompressionDelayInSnapshotPolicy|InstanceToOAPI_EmitsStandbyCompressionDelayInSnapshotPolicy|StandbyInstance_MapsCompressionDelay|StandbyInstance_InvalidCompressionDelay|StandbyInstance_InvalidRequest)$'

Notes

  • go test ./cmd/api/api still hits unrelated environment-dependent volume tests on this machine because mkfs.ext4 is not available in $PATH.

Note

Medium Risk
Changes standby snapshot compression scheduling and cancellation semantics (including restart recovery) and updates networking iptables rule management to wait on the xtables lock; these touch core instance lifecycle and host networking paths but are gated behind new optional fields and covered by tests.

Overview
Adds a standby-only snapshot compression delay that can be set per request via POST /instances/{id}/standby (compression_delay) and as a per-instance default via snapshot_policy.standby_compression_delay (OpenAPI + API/domain mappings + validation).

Implements delayed compression jobs in the instance manager: jobs can be pending-delay vs running, can be skipped if canceled before start, persist a PendingStandbyCompression plan in instance metadata for restart recovery, and clear that plan on restore/snapshot operations; preemption metrics are now recorded only when interrupting active compression.

Extends observability with new snapshot compression metrics (wait duration, active vs pending gauges, skipped result), additional logs/traces, and includes targeted unit/integration test hardening (guest exec retries/readiness probe) plus more reliable iptables operations/tests using iptables -w 5 to avoid xtables lock flakes.

Reviewed by Cursor Bugbot for commit 17cb581. Bugbot is set up for automated code reviews on this repo. Configure here.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 4, 2026

✱ Stainless preview builds

This PR will update the hypeman SDKs with the following commit message.

feat: Add standby compression start delay

Edit this comment to update it. It will appear in the SDK's changelogs.

hypeman-typescript studio · code · diff

Your SDK build had at least one "note" diagnostic, but this did not represent a regression.
generate ✅build ✅lint ✅test ✅

npm install https://pkg.stainless.com/s/hypeman-typescript/6680c47c46cd679b3f325af59349cc22edc8b398/dist.tar.gz
hypeman-openapi studio · code · diff

Your SDK build had at least one "note" diagnostic, but this did not represent a regression.
generate ✅

hypeman-go studio · code · diff

Your SDK build had at least one "note" diagnostic, but this did not represent a regression.
generate ✅build ✅lint ✅test ✅

go get github.com/stainless-sdks/hypeman-go@4774f3fe7e08331e7e3d9ef3a77c748aabaa96e9

This comment is auto-generated by GitHub Actions and is automatically kept up to date as you push.
If you push custom code to the preview branch, re-run this workflow to update the comment.
Last updated: 2026-04-15 15:39:17 UTC

sjmiller609

This comment was marked as resolved.

@sjmiller609 sjmiller609 marked this pull request as ready for review April 8, 2026 13:12
Comment thread lib/instances/fork.go Outdated
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 080050b. Configure here.

Comment thread lib/instances/snapshot_compression.go
Comment thread lib/instances/network_test.go
Comment thread lib/network/bridge_linux.go
Comment thread skills/test-agent/agents/test-agent/NOTES.md
@sjmiller609 sjmiller609 requested a review from hiroTamada April 8, 2026 14:56
@sjmiller609
Copy link
Copy Markdown
Collaborator Author

waiting until data or use case justified

@sjmiller609 sjmiller609 marked this pull request as draft April 9, 2026 19:09
@sjmiller609 sjmiller609 marked this pull request as ready for review April 15, 2026 15:38
@firetiger-agent
Copy link
Copy Markdown

I'll monitor this standby compression delay feature for Hypeman. The change adds timing parameters for delaying compression after standby, with persistence across server restarts.

Key things I'll watch:

  • Hypeman invocation spawn success rate — baseline is ~99% for prod-jfk-hypeman-0/1. Any significant drop indicates the new compression state management may be affecting instance operations.
  • Compression job leaks — the new pending state and recovery logic could fail to clean up jobs properly. I'll watch for growing pending counts without matching completions.
  • Restore latency — changes to ensureSnapshotMemoryReady could slow instance restores if the new preemption logic has issues.

The new metrics (hypeman_snapshot_compression_wait_duration_seconds, hypeman_snapshot_compression_pending_total) will provide direct visibility once they appear in telemetry. I'll post updates as the deployment progresses.

View agent

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant