If model parameters are DTensors, optimizer states should also be DTensors.#2795
If model parameters are DTensors, optimizer states should also be DTensors.#2795cspades wants to merge 12 commits intoNVIDIA:mainfrom
Conversation
…sor. Signed-off-by: Cory Ye <cye@nvidia.com>
for more information, see https://pre-commit.ci
Signed-off-by: Cory Ye <cye@nvidia.com>
Greptile SummaryThis PR fixes a bug where Key changes:
Confidence Score: 4/5
Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A{param is DTensor?} -->|Yes| B[Extract local_param via _local_tensor]
A -->|No| C[local_param = param]
B --> D[Dequantize if QuantizedTensor]
C --> D
D --> E[Allocate data with target dtype and local shape]
E --> F{dtype == uint8?}
F -->|Yes| G[Float8Quantizer.make_empty using data.shape]
G --> H[quantize_ into Float8Tensor state]
F -->|No| I[state = data plain tensor]
H --> J{param is DTensor?}
I --> J
J -->|Yes| K[DTensor.from_local with global shape and placements]
J -->|No| L[Store plain tensor in self.state]
K --> L
L --> M[step - unwrap DTensor to local tensor for CUDA kernel]
L --> N[state_dict - unscale then re-wrap as DTensor for DCP]
N --> O[load_state_dict - DCP provides resharded DTensor]
O --> P[Unwrap local tensor, call set_scaled_state to rescale and store]
Reviews (7): Last reviewed commit: "Merge branch 'main' into cye/fused-adam-..." | Re-trigger Greptile |
|
@cspades could you please elaborate on the downstream error/issue caused. As in what happens if we load the unsharded tensor for optimizer state as plain tensor instead of DTensor? |
Here is how I understand it, @shjwudp correct me if I am wrong about the Megatron-FSDP details, as I still need to reproduce the bug and ensure this PR fixes it. I believe a customer reported this bug?
The fix is to keep the optimizer state in |
Add Greptile bug-fixes. Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Cory Ye <44509866+cspades@users.noreply.github.com>
Signed-off-by: Cory Ye <cye@nvidia.com>
Signed-off-by: Cory Ye <cye@nvidia.com>
for more information, see https://pre-commit.ci
… re-sharding test. Signed-off-by: Cory Ye <cye@nvidia.com>
for more information, see https://pre-commit.ci
pstjohn
left a comment
There was a problem hiding this comment.
TLDR it would fail if you train on 4 ranks and load on 2 ranks, this adds a test for this.
(among other issues with mFSDP)
Signed-off-by: Cory Ye <cye@nvidia.com>
|
/te-ci L1 pytorch |
Description
FusedAdam's optimizer state is converted into a non-distributed Tensor, which is loaded as a global / un-sharded state dictionary by Torch DCP. We wrap the optimizer state as a DTensor matching the distribution characteristics of the original DTensor parameter the state is associated with.Fixes a bug introduced by the new
DTensor(QuantizedTensor)(FSDP2-only) use case introduced in #2698 (as Megatron-FSDP just usesDTensor(Float32)for the distributed optimizer state).Testing
--use-precision-aware-optimizerType of change
Checklist: