Skip to content

GH-49103: [Python] Add internal type system stubs (_types, error, _stubs_typing)#48622

Open
rok wants to merge 13 commits intoapache:mainfrom
rok:pyarrow-stubs-pr2-core-types
Open

GH-49103: [Python] Add internal type system stubs (_types, error, _stubs_typing)#48622
rok wants to merge 13 commits intoapache:mainfrom
rok:pyarrow-stubs-pr2-core-types

Conversation

@rok
Copy link
Member

@rok rok commented Dec 22, 2025

Rationale for this change

This is the second in series of PRs adding type annotations to pyarrow and resolving #32609. It builds on top of and should be merged after #48618.

What changes are included in this PR?

This adds:

  • _types.pyi - Core type definitions including
  • _stubs_typing.pyi - Internal typing protocols and helpers used across stub files
  • error.pyi - Exception classes (ArrowException, ArrowInvalid, ArrowIOError, etc.)
  • Minimal placeholder stubs - lib.pyi, io.pyi, scalar.pyi - using __getattr__ to allow imports to resolve while deferring to subsequent PRs

Are these changes tested?

Via CI type checks established in #48618.

Are there any user-facing changes?

Users will start seeing some minimal annotated types.

@rok rok force-pushed the pyarrow-stubs-pr2-core-types branch from d3c5740 to 27d1c65 Compare January 26, 2026 12:57
@rok
Copy link
Member Author

rok commented Jan 26, 2026

I've rebased this on the annotation infra check PR (#48618) to make sure we're on the right track.

@rok rok force-pushed the pyarrow-stubs-pr2-core-types branch 2 times, most recently from 3f9ed3b to 0ac95b0 Compare January 26, 2026 19:11
@rok rok changed the title GH-32609: [Python] Add internal type system stubs (_types, error, _stubs_typing) GH-49103: [Python] Add internal type system stubs (_types, error, _stubs_typing) Jan 31, 2026
@github-actions
Copy link

⚠️ GitHub issue #49103 has been automatically assigned in GitHub to PR creator.

@rok rok force-pushed the pyarrow-stubs-pr2-core-types branch 3 times, most recently from 7873930 to 43e7cc6 Compare February 9, 2026 19:11
@rok rok marked this pull request as ready for review February 9, 2026 20:09
@rok rok force-pushed the pyarrow-stubs-pr2-core-types branch from 51a7a4e to 7bc0a98 Compare February 13, 2026 15:24
@rok rok force-pushed the pyarrow-stubs-pr2-core-types branch 5 times, most recently from 0d15871 to 8f3796d Compare March 9, 2026 23:51
@rok rok requested a review from dangotbanned March 10, 2026 00:01
@rok rok force-pushed the pyarrow-stubs-pr2-core-types branch from 8f3796d to 72571d2 Compare March 10, 2026 13:39
rok and others added 11 commits March 17, 2026 19:46
diff --git c/python/CMakeLists.txt i/python/CMakeLists.txt
index 6395b3e..f71a495 100644
--- c/python/CMakeLists.txt
+++ i/python/CMakeLists.txt
@@ -1042,9 +1042,9 @@ if(EXISTS "${PYARROW_STUBS_SOURCE_DIR}")
     install(CODE "
       execute_process(
         COMMAND \"${Python3_EXECUTABLE}\"
-                \"${CMAKE_CURRENT_SOURCE_DIR}/scripts/update_stub_docstrings.py\"
+                \"${CMAKE_SOURCE_DIR}/scripts/update_stub_docstrings.py\"
                 \"${CMAKE_INSTALL_PREFIX}\"
-                \"${CMAKE_CURRENT_SOURCE_DIR}\"
+                \"${CMAKE_SOURCE_DIR}\"
         RESULT_VARIABLE _pyarrow_stub_docstrings_result
       )
       if(NOT _pyarrow_stub_docstrings_result EQUAL 0)
Co-authored-by: Dan Redding <125183946+dangotbanned@users.noreply.github.com>
Co-authored-by: Dan Redding <125183946+dangotbanned@users.noreply.github.com>
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Mar 17, 2026
@rok rok force-pushed the pyarrow-stubs-pr2-core-types branch from 85f625c to 2cddfd2 Compare March 17, 2026 20:45
@github-actions github-actions bot added awaiting change review Awaiting change review awaiting changes Awaiting changes and removed awaiting changes Awaiting changes awaiting change review Awaiting change review labels Mar 17, 2026
@rok
Copy link
Member Author

rok commented Mar 17, 2026

@dangotbanned I'm about to have limited connectivity for 2 weeks. If you think this is ready we can ask @raulcd to do a last pass and merge, so we get it into the release.

@raulcd I found that I had to do this df7dce5 to get a working editable install. Seems to be due to filename collisions. Thoughts?

@rok rok force-pushed the pyarrow-stubs-pr2-core-types branch from 2cddfd2 to 104369a Compare March 17, 2026 21:34
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Mar 17, 2026
@rok rok requested a review from dangotbanned March 17, 2026 23:07
@raulcd
Copy link
Member

raulcd commented Mar 18, 2026

@raulcd I found that I had to do this df7dce5 to get a working editable install. Seems to be due to filename collisions. Thoughts?

What was the error? Seems strange that we have to avoid installing a single specific file (lib) for editable builds

@rok
Copy link
Member Author

rok commented Mar 18, 2026

The error I was getting without the CMake change was:

>>> import pyarrow
Traceback (most recent call last):
  File "<python-input-0>", line 1, in <module>
    import pyarrow
  File "/Users/rok/Documents/repos/arrow/python/pyarrow/__init__.py", line 59, in <module>
    from pyarrow.lib import (BuildInfo, CppBuildInfo, RuntimeInfo, set_timezone_db_path,
    ...<3 lines>...
                             io_thread_count, is_opentelemetry_enabled, set_io_thread_count)
ModuleNotFoundError: No module named 'pyarrow.lib'

With the CMake change it works fine.

In both cases I'm doing:

export ARROW_HOME=$(pwd)/dist
export CMAKE_PREFIX_PATH=$ARROW_HOME:$CMAKE_PREFIX_PATH
export DYLD_LIBRARY_PATH=$ARROW_HOME/lib:$DYLD_LIBRARY_PATH
uv pip install --no-build-isolation --editable .

@dangotbanned
Copy link

@dangotbanned I'm about to have limited connectivity for 2 weeks. If you think this is ready we can ask @raulcd to do a last pass and merge, so we get it into the release.

Sorry for the delay @rok!

Just starting to dig in again now

Copy link

@dangotbanned dangotbanned left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rok, nothing blocking for me.

A lot of these are just filling in (#48622 (comment)) - so don't be alarmed at the size 😂

Comment on lines +51 to +86
IntegerType: TypeAlias = (
lib.Int8Type
| lib.Int16Type
| lib.Int32Type
| lib.Int64Type
| lib.UInt8Type
| lib.UInt16Type
| lib.UInt32Type
| lib.UInt64Type
)

Mask: TypeAlias = (
Sequence[bool | None]
| NDArray[np.bool_]
| lib.Array[lib.Scalar[lib.BoolType]]
| ChunkedArray[Any]
)
Indices: TypeAlias = (
Sequence[int | None]
| NDArray[np.integer[Any]]
| lib.Array[lib.Scalar[IntegerType]]
| ChunkedArray[Any]
)

PyScalar: TypeAlias = (
bool
| int
| float
| Decimal
| str
| bytes
| dt.date
| dt.datetime
| dt.time
| dt.timedelta
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how many other cases you might have for this IntoArray.
But the building blocks to get there should be helpful anyway 🙂

Suggested change
IntegerType: TypeAlias = (
lib.Int8Type
| lib.Int16Type
| lib.Int32Type
| lib.Int64Type
| lib.UInt8Type
| lib.UInt16Type
| lib.UInt32Type
| lib.UInt64Type
)
Mask: TypeAlias = (
Sequence[bool | None]
| NDArray[np.bool_]
| lib.Array[lib.Scalar[lib.BoolType]]
| ChunkedArray[Any]
)
Indices: TypeAlias = (
Sequence[int | None]
| NDArray[np.integer[Any]]
| lib.Array[lib.Scalar[IntegerType]]
| ChunkedArray[Any]
)
PyScalar: TypeAlias = (
bool
| int
| float
| Decimal
| str
| bytes
| dt.date
| dt.datetime
| dt.time
| dt.timedelta
)
IntegerType: TypeAlias = (
lib.Int8Type
| lib.Int16Type
| lib.Int32Type
| lib.Int64Type
| lib.UInt8Type
| lib.UInt16Type
| lib.UInt32Type
| lib.UInt64Type
)
PyScalar: TypeAlias = (
bool
| int
| float
| Decimal
| str
| bytes
| dt.date
| dt.datetime
| dt.time
| dt.timedelta
)
NumpyScalar: TypeAlias = "np.generic[Any]"
PyScalarT_co = TypeVar("PyScalarT_co", bound=PyScalar, covariant=True)
NumpyScalarT_co = TypeVar("NumpyScalarT_co", bound=NumpyScalar, covariant=True)
DataTypeT_co = TypeVar("DataTypeT_co", bound=lib.DataType, covariant=True)
IntoArray: TypeAlias = (
Sequence[PyScalarT_co | None]
| NDArray[NumpyScalarT_co]
| lib.Array[lib.Scalar[DataTypeT_co]]
| ChunkedArray[Any]
)
Mask: TypeAlias = IntoArray[bool, np.bool_, lib.BoolType]
Indices: TypeAlias = IntoArray[int, np.integer[Any], IntegerType]

def _import_from_c(cls, in_ptr: int) -> Self: ...
def __arrow_c_schema__(self) -> Any: ...
@classmethod
def _import_from_c_capsule(cls, schema) -> Self: ...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _import_from_c_capsule(cls, schema) -> Self: ...
def _import_from_c_capsule(cls, schema: Any) -> Self: ...

UInt64Type,
Int64Type,
)
_BasicValueT = TypeVar("_BasicValueT", bound=_BasicDataType, default=_BasicDataType)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_BasicValueT = TypeVar("_BasicValueT", bound=_BasicDataType, default=_BasicDataType)
_BasicValueT = TypeVar(
"_BasicValueT", bound=_BasicDataType[Any], default=_BasicDataType[Any]
)


class StructType(DataType):
def get_field_index(self, name: str) -> int: ...
def field(self, i: int | str) -> Field: ...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def field(self, i: int | str) -> Field: ...
def field(self, i: int | str) -> Field[Any]: ...

def field(self, i: int | str) -> Field: ...
def get_all_field_indices(self, name: str) -> list[int]: ...
def __len__(self) -> int: ...
def __iter__(self) -> Iterator[Field]: ...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def __iter__(self) -> Iterator[Field]: ...
def __iter__(self) -> Iterator[Field[Any]]: ...

Comment on lines +545 to +553
def large_list(
value_type: _DataTypeT | Field[_DataTypeT] | None = None,
) -> LargeListType[_DataTypeT]: ...
def list_view(
value_type: _DataTypeT | Field[_DataTypeT] | None = None,
) -> ListViewType[_DataTypeT]: ...
def large_list_view(
value_type: _DataTypeT | Field[_DataTypeT] | None = None,
) -> LargeListViewType[_DataTypeT]: ...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might need to do another pass on where a default None appears:

Suggested change
def large_list(
value_type: _DataTypeT | Field[_DataTypeT] | None = None,
) -> LargeListType[_DataTypeT]: ...
def list_view(
value_type: _DataTypeT | Field[_DataTypeT] | None = None,
) -> ListViewType[_DataTypeT]: ...
def large_list_view(
value_type: _DataTypeT | Field[_DataTypeT] | None = None,
) -> LargeListViewType[_DataTypeT]: ...
def large_list(
value_type: _DataTypeT | Field[_DataTypeT],
) -> LargeListType[_DataTypeT]: ...
def list_view(
value_type: _DataTypeT | Field[_DataTypeT],
) -> ListViewType[_DataTypeT]: ...
def large_list_view(
value_type: _DataTypeT | Field[_DataTypeT],
) -> LargeListViewType[_DataTypeT]: ...

Using None or no parameters will raise:

>>> pa.list_()
TypeError: list_() takes at least 1 positional argument (0 given)

>>> pa.large_list(None)
TypeError: List requires DataType or Field

def unregister_extension_type(type_name: str) -> None: ...

_StrOrBytes: TypeAlias = str | bytes
_MetadataMapping: TypeAlias = Mapping[_StrOrBytes, _StrOrBytes]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multi-part suggestion (1/3)

Suggested change
_MetadataMapping: TypeAlias = Mapping[_StrOrBytes, _StrOrBytes]
_MetadataMapping: TypeAlias = (
Mapping[bytes, bytes] | Mapping[str, str] | Mapping[bytes, str] | Mapping[str, bytes]
)

Comment on lines +376 to +381
_SchemaMetadataInput: TypeAlias = (
Mapping[bytes, bytes]
| Mapping[str, str]
| Mapping[bytes, str]
| Mapping[str, bytes]
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multi-part suggestion (2/3)

Suggested change
_SchemaMetadataInput: TypeAlias = (
Mapping[bytes, bytes]
| Mapping[str, str]
| Mapping[bytes, str]
| Mapping[str, bytes]
)

| Iterable[tuple[str, _FieldTypeInput]]
| Mapping[Any, _FieldTypeInput]
),
metadata: _SchemaMetadataInput | None = None,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multi-part suggestion (3/3)

Suggested change
metadata: _SchemaMetadataInput | None = None,
metadata: _MetadataMapping | None = None,

Comment on lines +118 to +119
class Buffer(Protocol): ...
class SupportPyBuffer(Protocol): ...

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, these are to be expanded in subsequent PRs.

Would you suggest a change at this point?

No need, was just curious 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Python][Annotations] Add internal type system stubs (_types, error, _stubs_typing)

3 participants