-
Notifications
You must be signed in to change notification settings - Fork 456
Description
Apache Iceberg version
0.11.0 (latest release)
Please describe the bug 🐞
Description:
When passing an options dictionary to Table.scan(options=...), the properties (such as s3.connect-timeout or s3.request-timeout) are accepted by the DataScan object but are never propagated to the underlying FileIO (e.g., PyArrowFileIO) when actual data materialization occurs via methods like to_pandas() or to_arrow().
Because ArrowScan is initialized with the FileIO that was created during catalog instantiation (table.io), any S3-specific configurations provided at the scan level are completely bypassed. This causes operations reading numerous manifest files to fall back to the AWS C++ SDK default timeouts (often 10s-30s), leading to unexpected curlCode: 28 (Timeout was reached) errors even when generous timeouts are explicitly requested in the scan options.
Steps to Reproduce:
# 1. Load catalog with default (or no) S3 timeout properties
from pyiceberg.catalog import load_catalog
catalog = load_catalog("my_catalog", **{
"uri": "...",
"s3.endpoint": "..."
})
table = catalog.load_table("my_namespace.my_table")
# 2. Attempt to scan with explicit S3 timeout options
scan_options = {
"s3.connect-timeout": "600.0",
"s3.request-timeout": "600.0"
}
# The options are accepted by DataScan...
scan = table.scan(options=scan_options)
# 3. ...but completely ignored during S3 I/O operations (ArrowScan)
# This may throw a timeout error if RGW/S3 latency spikes, ignoring the 600s setting above.
df = scan.to_pandas()
Expected Behavior:
Properties passed via options in Table.scan() should cascade down and either update or override the table.io.properties for the duration of the scan. Specifically, s3.* configurations should be respected by the underlying FileIO (e.g., PyArrowFileIO) when downloading manifest lists or data files.
Actual Behavior:
The options passed to Table.scan() are stored in the DataScan instance but are never passed to the ArrowScan class or the FileIO instance during to_arrow() / to_pandas().
The ArrowScan relies entirely on the unmodified self.io object originally initialized by the catalog:
# In pyiceberg/table/__init__.py -> DataScan.to_arrow()
return ArrowScan(
self.table_metadata,
self.io, # <--- options are missing here!
self.projection(),
self.row_filter,
self.case_sensitive,
self.limit
).to_table(self.plan_files())
Environment:
- PyIceberg Version: 0.11.1 (and earlier)
- PyArrow Version: 18.0.0
- Storage: Ceph S3 / Rados Gateway (RGW)
Suggested Fix:
Ideally, DataScan should merge its options with self.io.properties and instantiate a new FileIO, or ArrowScan should be modified to accept the scan-level options and apply them dynamically to the FileSystem instance before reading files.
Willingness to contribute
- I can contribute a fix for this bug independently
- I would be willing to contribute a fix for this bug with guidance from the Iceberg community
- I cannot contribute a fix for this bug at this time