Skip to content

feat: Add pinned memory optimizer offload for Megatron policy worker#2248

Open
snivertynv wants to merge 1 commit intomainfrom
sniverty/optimizer_offload_perf
Open

feat: Add pinned memory optimizer offload for Megatron policy worker#2248
snivertynv wants to merge 1 commit intomainfrom
sniverty/optimizer_offload_perf

Conversation

@snivertynv
Copy link
Copy Markdown

This significantly improves performance for the optimizer_offload_before_refit pass which is quite expensive in co-located/syncRL cases.

Enabled/disabled using the use_pinned_optimizer_offload setting (default=disabled). It has been set to false in a couple of grpo_math* yaml configs as an example. Added test cases for this feature in test_megatron_worker.py

What does this PR do ?

Optimizer D2H/H2D transfers used per-tensor pageable allocations, causing expensive cudaHostAlloc calls and synchronous memcpy on every step. This adds an opt-in mode (use_pinned_optimizer_offload) that coalesces all optimizer state into a single cached pinned buffer, eliminating cudaHostAlloc from the hot path and enabling non-blocking DMA transfers.

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

…olled using the use_pinned_optimizer_offload setting - set to false in a couple of grpo_math* yaml configs as an example. Added test cases for this feature in test_megatron_worker.py

Signed-off-by: Sriharsha Niverty <sniverty@nvidia.com>
@snivertynv snivertynv self-assigned this Apr 10, 2026
@snivertynv snivertynv requested review from a team and terrykong as code owners April 10, 2026 14:21
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 10, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@snivertynv snivertynv added the CI:L2 Run doctests, unit tests, functional tests, and convergence tests label Apr 10, 2026
@snivertynv
Copy link
Copy Markdown
Author

/ok to test 13e7f01

Copy link
Copy Markdown
Collaborator

@terrykong terrykong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@guyueh1 could you help review?

@terrykong terrykong requested a review from guyueh1 April 13, 2026 06:43
else:
optimizer_state = self.optimizer._get_state()

use_pinned = self.cfg.get("use_pinned_optimizer_offload", False)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not ideal, we should use self.cfg['use_pinned_optimizer_offload']; we should add this field to all necessary base configs under examples/configs to make sure the read doesn't fail.

def _get_or_alloc_pinned_buf(
self, attr_name: str, total_bytes: int
) -> torch.Tensor:
"""Return a cached pinned CPU buffer, allocating only on first use or resize."""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend having a def _delete_pinned_buf(self, attr_name) paired with it; in the optimizer offload case, we repeatedly use the same buffer and we don't want to delete until job finish, but if you leave this API here and someone implements some new temporal offloading with it, they will assume you handled the deletion but actually not.

So either we add a deletion api here too and show people how to use them in pairs, or we make this allocation API specific to optimizer offloading (in naming or just hardcode it in the _optimizer_to functions), so that folks don't use it for other purposes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L2 Run doctests, unit tests, functional tests, and convergence tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants