Skip to content

[PyTorch] [CI] Capture subprocess stderr in distributed tests for better CI error re…#2802

Open
sudhakarsingh27 wants to merge 2 commits intoNVIDIA:mainfrom
sudhakarsingh27:sudhakars/improve-distributed-test-errors
Open

[PyTorch] [CI] Capture subprocess stderr in distributed tests for better CI error re…#2802
sudhakarsingh27 wants to merge 2 commits intoNVIDIA:mainfrom
sudhakarsingh27:sudhakars/improve-distributed-test-errors

Conversation

@sudhakarsingh27
Copy link
Collaborator

…porting

Distributed tests launch subprocesses via torch.distributed.launch/torchrun. When these fail, pytest only captures the CalledProcessError from the parent process, not the actual worker traceback. This makes CI JUnit XML reports show "exit code 1" with no useful error detail.

Add run_distributed() utility to tests/pytorch/utils.py that captures stderr while letting stdout stream to the terminal. On failure, the worker's stderr (containing the actual Python traceback) is included in the AssertionError, which pytest writes into the JUnit XML report.

Behavior:

  • Interactive use: stdout streams in real time (unchanged), stderr shown on failure
  • CI/JUnit XML: failure reports now include the actual worker traceback

Description

Please include a brief summary of the changes, relevant motivation and context.

Fixes # (issue)

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Change A
  • Change B

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

…porting

Distributed tests launch subprocesses via torch.distributed.launch/torchrun.
When these fail, pytest only captures the CalledProcessError from the parent
process, not the actual worker traceback. This makes CI JUnit XML reports
show "exit code 1" with no useful error detail.

Add run_distributed() utility to tests/pytorch/utils.py that captures stderr
while letting stdout stream to the terminal. On failure, the worker's stderr
(containing the actual Python traceback) is included in the AssertionError,
which pytest writes into the JUnit XML report.

Behavior:
- Interactive use: stdout streams in real time (unchanged), stderr shown on failure
- CI/JUnit XML: failure reports now include the actual worker traceback

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
Add --output-junit flag so ctest writes JUnit XML to /logs/,
matching the pattern used by pytest tests. The XML is written
before ctest exits, so it's captured even on test failure.

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
@sudhakarsingh27 sudhakarsingh27 marked this pull request as ready for review March 27, 2026 07:51
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 27, 2026

Greptile Summary

This PR introduces a run_distributed() utility in tests/pytorch/utils.py to improve CI error reporting for distributed PyTorch tests. Previously, when a torchrun/torch.distributed.run subprocess failed, pytest would only capture the CalledProcessError exit code from the parent process — the actual worker traceback was lost. The new helper captures stderr via subprocess.PIPE while keeping stdout streaming to the terminal, then surfaces the worker traceback in an AssertionError that pytest includes in JUnit XML reports. Five test files are migrated to the new helper, and qa/L0_cppunittest/test.sh gains JUnit XML output from ctest.\n\nKey changes and notes:\n- run_distributed() truncates stderr to the last 4,000 characters, which keeps failure messages compact while preserving the most relevant Python traceback tail.\n- A valid_returncodes parameter cleanly replaces the previous assert result.returncode in (0, 5) pattern used in the FSDP2 tests (pytest exits with 5 when all tests are skipped).\n- test_cast_master_weights_to_fp8 (the non-nvfp4 test function at line 1008) was not migrated — it still uses subprocess.run(..., check=True), leaving its CI failures without traceback detail.\n- import subprocess is now unused in test_attention_with_cp.py, test_fusible_ops_with_userbuffers.py, and test_torch_fsdp2.py and should be removed.

Confidence Score: 4/5

Safe to merge; all issues are style/consistency P2s with no functional regressions introduced.

The core utility is correct and the migration of five test callers is clean. The one missed migration (test_cast_master_weights_to_fp8 at line 1008) is an inconsistency that doesn't break anything — it just means that one test still lacks improved error reporting. Unused subprocess imports are minor cleanup. No logic errors or regressions are introduced.

tests/pytorch/distributed/test_cast_master_weights_to_fp8.py — has an unmigrated subprocess.run call at line 1008

Important Files Changed

Filename Overview
tests/pytorch/utils.py New run_distributed() utility captures stderr for better CI JUnit XML reporting; minor concern with **kwargs passthrough allowing caller to shadow stderr/text
tests/pytorch/distributed/test_cast_master_weights_to_fp8.py Migrates test_nvfp4_partial_cast_matches_full to run_distributed but leaves test_cast_master_weights_to_fp8 (line 1008) using the old subprocess.run(check=True) pattern
tests/pytorch/attention/test_attention_with_cp.py Both subprocess.run calls migrated to run_distributed; subprocess import is now unused
tests/pytorch/distributed/test_fusible_ops_with_userbuffers.py subprocess.run migrated to run_distributed; subprocess import is now unused
tests/pytorch/distributed/test_torch_fsdp2.py Both torchrun calls migrated cleanly; valid_returncodes=(0, 5) correctly preserves pytest exit-code-5 semantics; subprocess import now unused
qa/L0_cppunittest/test.sh Adds JUnit XML output to ctest via --output-junit and ensures log directory exists; unrelated to the distributed-test stderr capture but a sensible parallel improvement

Sequence Diagram

sequenceDiagram
    participant PT as pytest (parent)
    participant RD as run_distributed()
    participant SP as subprocess (torchrun/worker)

    PT->>RD: call run_distributed(args, **kwargs)
    RD->>SP: subprocess.run(stderr=PIPE, text=True)
    Note over SP: stdout → terminal (real-time)
    Note over SP: stderr → captured in memory
    SP-->>RD: CompletedProcess(returncode, stderr)
    alt returncode in valid_returncodes
        RD-->>PT: return CompletedProcess
    else returncode not in valid_returncodes
        RD->>RD: truncate stderr to last 4000 chars
        RD-->>PT: raise AssertionError(cmd + stderr tail)
        Note over PT: pytest writes AssertionError into JUnit XML report
    end
Loading

Comments Outside Diff (2)

  1. tests/pytorch/distributed/test_cast_master_weights_to_fp8.py, line 1008-1011 (link)

    P2 Incomplete migration to run_distributed

    test_cast_master_weights_to_fp8 still uses the old subprocess.run(..., check=True) pattern. This test launches a distributed worker via torch.distributed.run, exactly the same kind of subprocess the PR is targeting — so CI failures here will still show "exit code 1" with no traceback.

    This is also the only reason import subprocess is still needed in this file; migrating this call would let that import be removed too.

  2. tests/pytorch/attention/test_attention_with_cp.py, line 6 (link)

    P2 Unused subprocess import

    Both subprocess.run calls in this file were replaced by run_distributed, leaving import subprocess with no remaining usages. The same applies to tests/pytorch/distributed/test_fusible_ops_with_userbuffers.py (line 13) and tests/pytorch/distributed/test_torch_fsdp2.py (line 7).

    The three files with the stale import are:

    • tests/pytorch/attention/test_attention_with_cp.py:6
    • tests/pytorch/distributed/test_fusible_ops_with_userbuffers.py:13
    • tests/pytorch/distributed/test_torch_fsdp2.py:7

Reviews (1): Last reviewed commit: "Add JUnit XML output to ctest in L0_cppu..." | Re-trigger Greptile

Use (0, 5) for inner pytest runs where 5 means all tests skipped.
**kwargs: Passed through to subprocess.run (e.g. env, timeout).
"""
result = subprocess.run(args, stderr=subprocess.PIPE, text=True, **kwargs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 **kwargs can silently conflict with stderr and text

If a caller ever passes stderr= or text= through **kwargs, Python will raise TypeError: subprocess.run() got multiple values for keyword argument 'stderr'. Consider explicitly popping or blocking those keys, or documenting the restriction:

kwargs.pop("stderr", None)  # always captured internally
kwargs.pop("text", None)    # always text mode internally
result = subprocess.run(args, stderr=subprocess.PIPE, text=True, **kwargs)

None of the current call sites pass these, so this is not an immediate bug — just a fragile API surface.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant