Change bias initialization from 'embed' to 'heads' by csgoogle · Pull Request #371 · AI-Hypercomputer/maxdiffusion

csgoogle · 2026-04-06T10:09:51Z

Fix incorrect logical partitioning axes for attention and feed-forward parameters in Flax/WAN/LTX2 modules.
Refactor flash-attention block-size selection into a helper and add unit tests

doc: https://docs.google.com/document/d/1absFkpQAMM3YaYWxO_FYeqzDpypYeDbPsJRAV86nFQ0/edit?usp=sharing&resourcekey=0-FOzOmM0UdfU1LcDd_7epvw

Results

Metric	`main`	`fixbiassharding`	Δ
Compile time	1913.9s	1906.4s	-7.5s
Inference time	1656.4s	1642.1s	-14.3s (-0.9%)

Notes

No difference observed with tp=1 configs — improvement only surfaces when tensor parallelism is active, as the axis fixes reduce parameter all-gather overhead in MLP layers
Primary motivation for this change is correctness: incorrect sharding axes can cause OOM or numerical issues at other parallelism configs
Larger gains expected at tp=4 or tp=8 where parameter communication is a larger fraction of step time

Video Quality Comparison

Branch	Video
`main`	main.mp4
`fixbiassharding`	fixbiassharding.mp4

PSNR/SSIM (frame-by-frame, 81 frames):

Metric	Mean	Min	Max
PSNR	19.37 dB	18.83	20.17
SSIM	0.7884	0.7654	0.8043

Low PSNR/SSIM reflects floating point non-determinism from different sharding layouts across 50 denoising steps (bfloat16 + different collective patterns) — videos are visually identical.

Video and Xprof after fix:

https://console.cloud.google.com/storage/browser/sagarchapara/shardingfixes

github-actions · 2026-04-06T10:11:10Z

e2e testgrid: https://8bcf50593faf4ea38060e236169827e5-dot-us-central1.composer.googleusercontent.com/dags/maxdiffusion_tpu_e2e/grid

Perseus14 · 2026-04-15T05:34:05Z

Could you add more details and results on the new commits? @csgoogle

csgoogle · 2026-04-15T18:18:13Z

Could you add more details and results on the new commits? @csgoogle

done

github-actions · 2026-04-15T18:22:50Z

🤖 Hi @csgoogle, I've received your request, and I'm working on it now! You can track my progress in the logs for more details.

github-actions · 2026-04-15T18:25:30Z

🤖 I'm sorry @csgoogle, but I was unable to process your request. Please see the logs for more details.

github-actions

## 📋 Review Summary

This Pull Request provides critical fixes for logical partitioning axes across several models (Flax, LTX2, Wan) and refactors flash-attention block-size selection. These changes are essential for correctness and performance efficiency when using tensor parallelism, specifically by reducing parameter communication overhead.

🔍 General Feedback

Bug Fix: The refactoring of flash-attention block size selection into _select_flash_block_sizes correctly addresses a latent bug where the wrong axis (head dimension instead of sequence length) was used for rank-3 inputs.
Performance: Correcting the bias sharding axes for QKV and output projections, as well as MLP layers, will optimize communication patterns in multi-device TPU setups.
Testing: The addition of unit tests for the new flash-attention helper functions ensures the robustness of the block size selection logic.

csgoogle requested a review from entrpn as a code owner April 6, 2026 10:09

csgoogle force-pushed the fixbiassharding branch 2 times, most recently from d822acb to 15af39f Compare April 13, 2026 10:41

entrpn previously approved these changes Apr 14, 2026

View reviewed changes

github-actions bot added the pull ready label Apr 14, 2026

csgoogle dismissed entrpn’s stale review via 3331fed April 15, 2026 14:41

csgoogle force-pushed the fixbiassharding branch 5 times, most recently from 9780b17 to 7a6ab88 Compare April 15, 2026 14:59

Fix transformer sharding and cross-attention flash block sizes

2ddf8ab

csgoogle force-pushed the fixbiassharding branch from 20430ab to 2ddf8ab Compare April 15, 2026 15:05

csgoogle removed the pull ready label Apr 15, 2026

csgoogle commented Apr 15, 2026

View reviewed changes

Comment thread src/maxdiffusion/models/attention_flax.py

Perseus14 approved these changes Apr 15, 2026

View reviewed changes

entrpn approved these changes Apr 15, 2026

View reviewed changes

csgoogle added the gemini-review label Apr 15, 2026

csgoogle removed the gemini-review label Apr 15, 2026

github-actions bot reviewed Apr 15, 2026

View reviewed changes

github-actions bot added the pull ready label Apr 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Change bias initialization from 'embed' to 'heads'#371

Change bias initialization from 'embed' to 'heads'#371
csgoogle wants to merge 1 commit intomainfrom
fixbiassharding

csgoogle commented Apr 6, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Apr 6, 2026

Uh oh!

Perseus14 commented Apr 15, 2026

Uh oh!

Uh oh!

csgoogle commented Apr 15, 2026

Uh oh!

github-actions bot commented Apr 15, 2026

Uh oh!

github-actions bot commented Apr 15, 2026

Uh oh!

github-actions bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

csgoogle commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Results

Notes

Video Quality Comparison

Uh oh!

github-actions bot commented Apr 6, 2026

Uh oh!

Perseus14 commented Apr 15, 2026

Uh oh!

Uh oh!

csgoogle commented Apr 15, 2026

Uh oh!

github-actions bot commented Apr 15, 2026

Uh oh!

github-actions bot commented Apr 15, 2026

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

🔍 General Feedback

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

csgoogle commented Apr 6, 2026 •

edited

Loading