Fix QEMU startup cleanup for failed launches#197
Conversation
|
Firetiger deploy monitoring skipped This PR didn't match the auto-monitor filter configured on your GitHub connection:
Reason: PR modifies QEMU hypervisor cleanup logic, not kernel API endpoints or Temporal workflows as specified in the filter. To monitor this PR anyway, reply with |
|
@firetiger monitor this |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit e49b01c. Configure here.
|
I'm working on a monitoring plan for this PR. You can follow the progress here. Tag me in a comment at any point to steer the plan. When this PR merges, I'll watch for deployments and use that as the signal to start monitoring. |
|
I'll monitor this QEMU startup cleanup refactor. The changes introduce proper process lifecycle management with faster failure detection when QEMU exits early. What I'm watching:
The change is low-medium risk since Hypeman handles ~7.5% of serverless spawn volume. The unit tests look solid, covering both cleanup and early exit detection. I'll post updates as the deployment progresses. |

Summary
qemu.sockon any failed attempt or retryTesting
Context
This addresses the startup-path bug behind instances getting stuck in
Unknownafter QEMU launch retries leave behind a stale monitor socket and stale PID metadata.Note
Medium Risk
Touches QEMU process lifecycle management (start/kill/wait) and socket readiness logic; mistakes here could cause hangs or kill the wrong process, but the change is scoped and covered by focused tests.
Overview
Improves QEMU startup failure handling by tracking the launched process and ensuring it is killed and reaped (avoiding zombies) and that
qemu.sockis removed on failed attempts.Startup now fails fast if QEMU exits before the QMP socket becomes reachable (
waitForSocketOrExit) and also checks for early exit while retrying QMP client creation. Adds unit tests to verify cleanup reaps exited processes and removes stale sockets, and that socket-wait returns quickly when the process dies.Reviewed by Cursor Bugbot for commit e49b01c. Bugbot is set up for automated code reviews on this repo. Configure here.