Skip to content

Fix race condition in ProcessesSync causing SSH key mismatch#5033

Merged
jochenehret merged 1 commit intomainfrom
fix-ssh-key-mismatch-race-condition
Apr 17, 2026
Merged

Fix race condition in ProcessesSync causing SSH key mismatch#5033
jochenehret merged 1 commit intomainfrom
fix-ssh-key-mismatch-race-condition

Conversation

@jochenehret
Copy link
Copy Markdown
Contributor

@jochenehret jochenehret commented Apr 15, 2026

A short explanation of the proposed change:

The ProcessSync job could update an LRP with stale SSH route data from a previous version when the process version changed between the initial scan and the batch execution. As a result, the LRP has a different RSA key pair for the diego-sshd daemon and the diego-ssh route. "cf ssh" is broken until the app is manually restarted.

Fix: Before executing desire/update, verify the process version still matches what was recorded during the initial scan. Skip if mismatched - the next sync cycle will handle it correctly.

An explanation of the use cases your change solves

Detailed diagram of the race condition:

sequenceDiagram
participant CC as Cloud Controller<br/>(ccdb)
participant PS as ProcessesSync<br/>cloud_controller_clock
participant BBS as Diego BBS

Note over CC,BBS: Initial state: process p1-v1, keypair k1, timestamp t1

PS->>BBS: fetch_desired_lrps → [p1-v1, annotation=t1]
PS->>CC: fetch_processes → [p1-v1, updated_at=t1]
Note over PS: annotations match → nothing to do

Note over CC,BBS: User restarts app (1st restart)
CC->>CC: process: guid=p1, version=v1→v2, updated_at=t2a
CC->>BBS: delete desired LRP p1-v1
CC->>BBS: desire_lrp p1-v2 (keypair k2, annotation=t2b)

Note over PS: ProcessesSync starts again
PS->>BBS: fetch_desired_lrps → [p1-v2, annotation=t2b]
PS->>CC: fetch_processes → [p1-v2, updated_at=t2a]
Note over PS: t2a ≠ t2b → add p1 to update list
Note over PS: The timestamp mismatch is already fixed with ccng PR 4935
Note over PS: 💤 sleep 15s (simulated delay)

Note over CC,BBS: User restarts app (2nd restart) during sleep
CC->>CC: process: guid=p1, version=v2→v3, updated_at=t3a
CC->>BBS: delete desired LRP p1-v2
CC->>BBS: desire_lrp p1-v3 (keypair k3, annotation=t3b)
Note over BBS: BBS now has p1-v3 with k3 in route AND action ✓

Note over PS: sleep ends → resume update_lrps
PS->>CC: reload process by id p1 → gets version v3 (updated_at=t3a)
PS->>BBS: update_desired_lrp p1-v3<br/>routes: k2 (copied from stale p1-v2 snapshot!)<br/>action: not updatable → stays k3

Note over BBS: ❌ BBS now has p1-v3:<br/>diego-ssh route → k2<br/>diego-sshd action → k3<br/>MISMATCH!

Note over CC,BBS: cf ssh fails until app is restarted again
Loading

The key mechanics that make this work:

  • Line 88: update_lrps reloads the process from ccdb by id (gets v3) — so it uses the current process version as the target LRP guid
  • Line 92: but it calls build_app_lrp_update(existing_lrp) where existing_lrp is the stale snapshot fetched at the start of the sync cycle (p1-v2 with k2)
  • build_app_lrp_update copies the SSH route from that stale existing_lrp → k2 goes into the update
  • BBS UpdateDesiredLRP cannot touch run_info (action) → k3 stays
  • Result: route has k2, sshd has k3

Links to any other associated PRs

#4935

  • I have reviewed the contributing guide

  • I have viewed, signed, and submitted the Contributor License Agreement

  • I have made this pull request to the main branch

  • I have run all the unit tests using bundle exec rake

  • I have run CF Acceptance Tests -> ran CATs against cf-deployment v54.13.0

The sync could update an LRP with stale SSH route data from a previous
version when the process version changed between the initial scan and
the batch execution. As a result, the LRP has a different RSA key pair
for the diego-sshd daemon and the diego-ssh route. "cf ssh" is broken
until the app is manually restarted.

Fix: Before executing desire/update, verify the process version still
matches what was recorded during the initial scan. Skip if mismatched -
the next sync cycle will handle it correctly.

Co-authored-by: Jochen Ehret <jochen.ehret@sap.com>
@jochenehret jochenehret marked this pull request as ready for review April 15, 2026 14:50
@jochenehret jochenehret requested a review from a team April 15, 2026 14:50
@jochenehret jochenehret merged commit 07c8edf into main Apr 17, 2026
11 checks passed
ari-wg-gitbot added a commit to cloudfoundry/capi-release that referenced this pull request Apr 17, 2026
Changes in cloud_controller_ng:

- Fix race condition in ProcessesSync causing SSH key mismatch
    PR: cloudfoundry/cloud_controller_ng#5033
    Author: Jochen Ehret <jochen.ehret@sap.com>
    Author: Philipp Thun <philipp.thun@sap.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants