feat: Add pinned memory optimizer offload for Megatron policy worker by snivertynv · Pull Request #2248 · NVIDIA-NeMo/RL

snivertynv · 2026-04-10T14:21:11Z

This significantly improves performance for the optimizer_offload_before_refit pass which is quite expensive in co-located/syncRL cases.

Enabled/disabled using the use_pinned_optimizer_offload setting (default=disabled). It has been set to false in a couple of grpo_math* yaml configs as an example. Added test cases for this feature in test_megatron_worker.py

What does this PR do ?

Optimizer D2H/H2D transfers used per-tensor pageable allocations, causing expensive cudaHostAlloc calls and synchronous memcpy on every step. This adds an opt-in mode (use_pinned_optimizer_offload) that coalesces all optimizer state into a single cached pinned buffer, eliminating cudaHostAlloc from the hot path and enabling non-blocking DMA transfers.

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

…olled using the use_pinned_optimizer_offload setting - set to false in a couple of grpo_math* yaml configs as an example. Added test cases for this feature in test_megatron_worker.py Signed-off-by: Sriharsha Niverty <sniverty@nvidia.com>

copy-pr-bot · 2026-04-10T14:21:15Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

snivertynv · 2026-04-10T14:23:42Z

/ok to test 13e7f01

terrykong

@guyueh1 could you help review?

guyueh1 · 2026-04-14T22:53:14Z

nemo_rl/models/policy/workers/megatron_policy_worker.py

        else:
            optimizer_state = self.optimizer._get_state()
+
+        use_pinned = self.cfg.get("use_pinned_optimizer_offload", False)


this is not ideal, we should use self.cfg['use_pinned_optimizer_offload']; we should add this field to all necessary base configs under examples/configs to make sure the read doesn't fail.

guyueh1 · 2026-04-14T23:03:18Z

nemo_rl/models/policy/workers/megatron_policy_worker.py

+    def _get_or_alloc_pinned_buf(
+        self, attr_name: str, total_bytes: int
+    ) -> torch.Tensor:
+        """Return a cached pinned CPU buffer, allocating only on first use or resize."""


I recommend having a def _delete_pinned_buf(self, attr_name) paired with it; in the optimizer offload case, we repeatedly use the same buffer and we don't want to delete until job finish, but if you leave this API here and someone implements some new temporal offloading with it, they will assume you handled the deletion but actually not.

So either we add a deletion api here too and show people how to use them in pairs, or we make this allocation API specific to optimizer offloading (in naming or just hardcode it in the _optimizer_to functions), so that folks don't use it for other purposes.

snivertynv self-assigned this Apr 10, 2026

snivertynv requested review from a team and terrykong as code owners April 10, 2026 14:21

snivertynv added the CI:L2 Run doctests, unit tests, functional tests, and convergence tests label Apr 10, 2026

copy-pr-bot bot temporarily deployed to nemo-ci April 10, 2026 14:24 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci April 10, 2026 16:06 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci April 10, 2026 17:44 Inactive

terrykong reviewed Apr 13, 2026

View reviewed changes

terrykong requested a review from guyueh1 April 13, 2026 06:43

guyueh1 reviewed Apr 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add pinned memory optimizer offload for Megatron policy worker#2248

feat: Add pinned memory optimizer offload for Megatron policy worker#2248
snivertynv wants to merge 1 commit intomainfrom
sniverty/optimizer_offload_perf

snivertynv commented Apr 10, 2026

Uh oh!

copy-pr-bot bot commented Apr 10, 2026

Uh oh!

snivertynv commented Apr 10, 2026

Uh oh!

terrykong left a comment

Uh oh!

guyueh1 Apr 14, 2026

Uh oh!

guyueh1 Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

snivertynv commented Apr 10, 2026

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot bot commented Apr 10, 2026

Uh oh!

snivertynv commented Apr 10, 2026

Uh oh!

terrykong left a comment

Choose a reason for hiding this comment

Uh oh!

guyueh1 Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

guyueh1 Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants