[PyTorch] [CI] Capture subprocess stderr in distributed tests for better CI error re…#2802
Conversation
…porting Distributed tests launch subprocesses via torch.distributed.launch/torchrun. When these fail, pytest only captures the CalledProcessError from the parent process, not the actual worker traceback. This makes CI JUnit XML reports show "exit code 1" with no useful error detail. Add run_distributed() utility to tests/pytorch/utils.py that captures stderr while letting stdout stream to the terminal. On failure, the worker's stderr (containing the actual Python traceback) is included in the AssertionError, which pytest writes into the JUnit XML report. Behavior: - Interactive use: stdout streams in real time (unchanged), stderr shown on failure - CI/JUnit XML: failure reports now include the actual worker traceback Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
Add --output-junit flag so ctest writes JUnit XML to /logs/, matching the pattern used by pytest tests. The XML is written before ctest exits, so it's captured even on test failure. Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>
Greptile SummaryThis PR introduces a Confidence Score: 4/5Safe to merge; all issues are style/consistency P2s with no functional regressions introduced. The core utility is correct and the migration of five test callers is clean. The one missed migration (test_cast_master_weights_to_fp8 at line 1008) is an inconsistency that doesn't break anything — it just means that one test still lacks improved error reporting. Unused subprocess imports are minor cleanup. No logic errors or regressions are introduced. tests/pytorch/distributed/test_cast_master_weights_to_fp8.py — has an unmigrated subprocess.run call at line 1008 Important Files Changed
Sequence DiagramsequenceDiagram
participant PT as pytest (parent)
participant RD as run_distributed()
participant SP as subprocess (torchrun/worker)
PT->>RD: call run_distributed(args, **kwargs)
RD->>SP: subprocess.run(stderr=PIPE, text=True)
Note over SP: stdout → terminal (real-time)
Note over SP: stderr → captured in memory
SP-->>RD: CompletedProcess(returncode, stderr)
alt returncode in valid_returncodes
RD-->>PT: return CompletedProcess
else returncode not in valid_returncodes
RD->>RD: truncate stderr to last 4000 chars
RD-->>PT: raise AssertionError(cmd + stderr tail)
Note over PT: pytest writes AssertionError into JUnit XML report
end
|
| Use (0, 5) for inner pytest runs where 5 means all tests skipped. | ||
| **kwargs: Passed through to subprocess.run (e.g. env, timeout). | ||
| """ | ||
| result = subprocess.run(args, stderr=subprocess.PIPE, text=True, **kwargs) |
There was a problem hiding this comment.
**kwargs can silently conflict with stderr and text
If a caller ever passes stderr= or text= through **kwargs, Python will raise TypeError: subprocess.run() got multiple values for keyword argument 'stderr'. Consider explicitly popping or blocking those keys, or documenting the restriction:
kwargs.pop("stderr", None) # always captured internally
kwargs.pop("text", None) # always text mode internally
result = subprocess.run(args, stderr=subprocess.PIPE, text=True, **kwargs)None of the current call sites pass these, so this is not an immediate bug — just a fragile API surface.
…porting
Distributed tests launch subprocesses via torch.distributed.launch/torchrun. When these fail, pytest only captures the CalledProcessError from the parent process, not the actual worker traceback. This makes CI JUnit XML reports show "exit code 1" with no useful error detail.
Add run_distributed() utility to tests/pytorch/utils.py that captures stderr while letting stdout stream to the terminal. On failure, the worker's stderr (containing the actual Python traceback) is included in the AssertionError, which pytest writes into the JUnit XML report.
Behavior:
Description
Please include a brief summary of the changes, relevant motivation and context.
Fixes # (issue)
Type of change
Changes
Please list the changes introduced in this PR:
Checklist: