Skip to content

fix(snapshots): Chunk image uploads to avoid fd exhaustion and 413 errors#3249

Draft
NicoHinderling wants to merge 2 commits into03-26-perf_snapshots_parallelize_image_hashing_with_rayonfrom
fix/chunk-snapshot-uploads
Draft

fix(snapshots): Chunk image uploads to avoid fd exhaustion and 413 errors#3249
NicoHinderling wants to merge 2 commits into03-26-perf_snapshots_parallelize_image_hashing_with_rayonfrom
fix/chunk-snapshot-uploads

Conversation

@NicoHinderling
Copy link
Contributor

Uploading hundreds of images via build snapshots fails in two ways:

  1. Too many open files (EMFILE): All image file handles are opened upfront and held until the batch send completes. This exceeds OS file descriptor limits (~256 on macOS, ~1024 on Linux).
  2. 413 Payload Too Large: The objectstore client batches up to 1000 operations into a single HTTP request. With hundreds of images the multipart body exceeds the server's size limit.

Both failures were confirmed with a real 753-image upload. 734 out of 753 images failed with 413 errors.

This PR splits upload_images() into two phases:

  • Phase 1 (pre-process): Iterate all images to detect filename collisions, compute SHA-256 hashes, read sidecar metadata, and build manifest entries — without opening any files for upload.
  • Phase 2 (chunked upload): Upload in batches of 100 images. For each batch, open files, build a fresh session.many() builder, and send. This bounds both open file descriptors and HTTP request payload size.

Also includes minor code quality improvements: simplified scope extraction, iterator-based collision formatting, and removal of redundant bindings.

@NicoHinderling NicoHinderling marked this pull request as ready for review March 26, 2026 20:28
@NicoHinderling NicoHinderling requested review from a team and szokeasaurusrex as code owners March 26, 2026 20:28

const IMAGE_EXTENSIONS: &[&str] = &["png", "jpg", "jpeg"];
const MAX_PIXELS_PER_IMAGE: u64 = 40_000_000;
const UPLOAD_BATCH_SIZE: usize = 100;
Copy link
Contributor Author

@NicoHinderling NicoHinderling Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

open to your guys' thoughts, but through some testing, seems to be a good number

Copy link
Contributor Author

NicoHinderling commented Mar 26, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

NicoHinderling and others added 2 commits March 26, 2026 14:28
…rors

Uploading hundreds of images fails in two ways: all file handles are
opened upfront causing EMFILE (too many open files), and the objectstore
client batches everything into one HTTP request causing 413 Payload Too
Large from the server.

Split upload_images() into two phases: pre-process all images without
opening files (collision detection, hashing, metadata), then upload in
chunks of 100. This bounds both open file descriptors and HTTP request
payload size regardless of total image count.

Co-Authored-By: Claude <noreply@anthropic.com>
Co-Authored-By: Claude <noreply@anthropic.com>
@NicoHinderling NicoHinderling changed the base branch from master to graphite-base/3249 March 26, 2026 21:28
@NicoHinderling NicoHinderling force-pushed the fix/chunk-snapshot-uploads branch from 938a884 to 835e271 Compare March 26, 2026 21:28
@NicoHinderling NicoHinderling changed the base branch from graphite-base/3249 to 03-26-perf_snapshots_parallelize_image_hashing_with_rayon March 26, 2026 21:29
@NicoHinderling NicoHinderling marked this pull request as draft March 26, 2026 21:33
@NicoHinderling
Copy link
Contributor Author

synced offline with @lcian , putting in draft mode since while fd limit is real, we on client side apparently shouldn't be handling the batching

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

let mut error_count = 0;
for error in errors {
let error = anyhow::Error::new(error);
for error in &errors {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upload error messages lose cause chain detail

Low Severity

The old code wrapped each upload error in anyhow::Error::new(error) before formatting with {error:#}, which displays the full error cause chain (e.g., "request failed: 413 Payload Too Large"). The new code removed this wrapping, so {error:#} on the raw error type likely only shows the top-level message since most non-anyhow types ignore the alternate flag. This loses valuable diagnostic context, especially for the 413 errors this PR aims to fix. Computing error_count via .len() before the loop would allow iterating by value and preserving the anyhow wrapping.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant