feat: Add Polars eager DataFrame support#82
Conversation
Add support for eagerly evaluated Polars DataFrames via a new PolarsImplementation class, registered under the "polars-eager" native type. This includes runtime detection (sys.modules), IPython display formatting, VegaFusion chart integration, filtering, serialization, and a full test suite mirroring existing pandas/PySpark coverage. Also adds polars[rtcompat] as a dev dependency and exposes a `lazy` property on oc.DataFrame to distinguish eager vs lazy backends. Contributes to BLU-5849
|
Note Reviews pausedUse the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThe PR adds first-class support for Polars eager DataFrames: a new PolarsEagerImplementation module implements filtering, sorting, pagination, serialization, and inspection; type aliases and runtime/native-type handling gain "polars-eager"; the DataFrame wrapper detects/wraps Polars eager frames and exposes a .lazy property; chart runtime and inline-dataset paths treat "polars-eager" like "pandas"; a MIME formatter for Polars is registered; tests and fixtures are extended to exercise Polars behavior. Sequence DiagramsequenceDiagram
participant Client
participant DataFrame
participant PolarsEagerImpl
participant Serializer
Client->>DataFrame: from_native(polars_df)
activate DataFrame
DataFrame->>DataFrame: is_polars_eager_dataframe?
DataFrame->>PolarsEagerImpl: construct(polars_df)
activate PolarsEagerImpl
PolarsEagerImpl->>PolarsEagerImpl: store native reference
deactivate PolarsEagerImpl
DataFrame->>Client: wrapped DataFrame
deactivate DataFrame
Client->>DataFrame: filter()/sort()/paginate()
activate DataFrame
DataFrame->>PolarsEagerImpl: call operation
activate PolarsEagerImpl
PolarsEagerImpl->>PolarsEagerImpl: build/execute polars expressions
PolarsEagerImpl->>PolarsEagerImpl: return new wrapper
deactivate PolarsEagerImpl
DataFrame->>Client: new DataFrame
Client->>DataFrame: to_records()/to_csv()
activate DataFrame
DataFrame->>PolarsEagerImpl: prepare_for_serialization()
activate PolarsEagerImpl
PolarsEagerImpl->>Serializer: serialize/truncate/format
deactivate PolarsEagerImpl
Serializer->>Client: records / csv
deactivate DataFrame
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
|
📦 Python package built successfully!
|
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@deepnote_toolkit/ocelots/polars/implementation.py`:
- Around line 282-288: The branch checking string types duplicates aliases
(pl.String and pl.Utf8); update the string-type condition in the block that
builds select_exprs (the clause that currently does "elif dtype == pl.String or
dtype == pl.Utf8") to use a single canonical check (e.g., "dtype == pl.Utf8" or
"dtype in (pl.Utf8, pl.String)") so the slice-and-alias expression
(col.str.slice(0, MAX_STRING_CELL_LENGTH).alias(name)) is only added once; keep
the surrounding integer/float check (dtype.is_integer() or dtype.is_float()) and
subsequent nested/binary/temporal checks unchanged.
- Around line 223-228: The IS_ON branch currently compares only calendar dates
which differs from pandas' exact-timestamp (timezone-normalized) semantics;
update the logic in the FilterOperator.IS_ON handling (around
_resolve_temporal_col and the condition variable) to normalize both the column
and the comparative literal to a common timezone (e.g. convert/replace to UTC)
and then compare full datetimes for equality instead of using .dt.date(), so the
polars condition mirrors pandas exact-timestamp matching.
- Line 234: The assignment now = datetime.now().astimezone() can yield
inconsistent results when the system timezone is unset; replace the call by
producing a consistent UTC-aware timestamp (e.g., use datetime.now(timezone.utc)
or datetime.now(tz=timezone.utc)) and import timezone from datetime as needed,
updating the occurrence in implementation.py where now is set (the now variable
assigned via datetime.now().astimezone()) so downstream logic always receives a
deterministic UTC-aware datetime.
In `@deepnote_toolkit/ocelots/utils.py`:
- Around line 45-50: The is_polars_dataframe function uses a redundant if/return
pattern; simplify by returning the boolean expression directly. Replace the
conditional block that checks sys.modules.get("polars") and isinstance(df,
polars_module.DataFrame) with a single return of that combined condition
(keeping the TypeGuard[PolarsDF] signature and references to polars_module and
sys.modules intact).
In `@tests/unit/test_ocelots.py`:
- Around line 55-57: The docstring in tests/unit/test_ocelots.py incorrectly
references a nonexistent parameter include_polars; remove that reference and
update the docstring parameter list and descriptions to match the current
function/test signature (replace or delete include_polars and ensure any
remaining params are described accurately), updating the explanatory text so it
aligns with the actual arguments used by the tests.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 04838624-ac27-42d0-b988-98fa469829ca
⛔ Files ignored due to path filters (1)
poetry.lockis excluded by!**/*.lock
📒 Files selected for processing (14)
deepnote_toolkit/chart/deepnote_chart.pydeepnote_toolkit/dataframe_utils.pydeepnote_toolkit/ocelots/__init__.pydeepnote_toolkit/ocelots/dataframe.pydeepnote_toolkit/ocelots/pandas/implementation.pydeepnote_toolkit/ocelots/polars/__init__.pydeepnote_toolkit/ocelots/polars/implementation.pydeepnote_toolkit/ocelots/pyspark/implementation.pydeepnote_toolkit/ocelots/types.pydeepnote_toolkit/ocelots/utils.pypyproject.tomltests/unit/helpers/testing_dataframes.pytests/unit/test_ocelots.pytests/unit/test_sql_execution.py
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #82 +/- ##
==========================================
+ Coverage 73.71% 74.32% +0.60%
==========================================
Files 93 94 +1
Lines 5284 5534 +250
Branches 766 824 +58
==========================================
+ Hits 3895 4113 +218
- Misses 1144 1155 +11
- Partials 245 266 +21
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@deepnote_toolkit/chart/deepnote_chart.py`:
- Around line 31-33: The checks for eager backends using the literal tuple
("pandas", "polars-eager") are duplicated; create a single shared constant
(e.g., EAGER_BACKENDS = ("pandas", "polars-eager")) or a small helper like
is_eager_backend(native_type) and replace direct tuple checks on
oc_df.native_type in deepnote_chart.py (places referencing oc_df.native_type and
the code paths that return vegafusion.VegaFusionRuntime()) with the
constant/helper, and update the other occurrence(s) (the similar check around
the second branch) to use the same constant/helper so backend support is
centralized and consistent.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 8d86af49-06fc-4621-a7f5-0a9ea570d524
📒 Files selected for processing (1)
deepnote_toolkit/chart/deepnote_chart.py
|
🚀 Review App Deployment Started
|
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@deepnote_toolkit/ocelots/polars/implementation_eager.py`:
- Around line 280-298: The loop that builds select_exprs treats boolean columns
like other types and casts them to string, breaking JSON boolean values; update
the branch logic in the for name, dtype in zip(self._df.columns,
self._df.dtypes) loop to explicitly detect boolean dtypes (e.g., dtype ==
pl.Boolean or dtype.is_boolean()) and append the column expression unchanged
(select_exprs.append(col)) before the string-cast fallback so boolean columns
remain boolean in JSON output; keep the existing handling for integers/floats,
string slicing, nested/binary/temporal (with _stringify_element), and the final
cast-to-string fallback for other types.
- Around line 242-251: The past-window branches for relative values ("week-ago",
"month-ago", "quarter-ago", "half-year-ago", "year-ago") only apply a lower
bound using dt_col >= pl.lit(now - ...), which allows future dates to match;
update each branch in implementation_eager.py so the condition uses both a lower
and an upper bound (e.g., dt_col >= pl.lit(now - delta) AND dt_col <=
pl.lit(now)) so rows dated after now are excluded—use the existing dt_col and
now symbols and combine with pl.lit(now) for the upper bound in the conditions.
- Around line 157-160: The boolean branch (when dtype == pl.Boolean) currently
assumes comparative_values are strings and does v.lower(), which fails for real
Python booleans; change the mapping over filter_obj.comparative_values to handle
types safely: for each v, if isinstance(v, str) use v.lower() == "true", elif
isinstance(v, bool) use v directly, else convert with bool(v) or skip/raise as
your semantics require; also update the surrounding error handling to catch
AttributeError in addition to ValueError/TypeError so non-string inputs don't
crash the IS_ONE_OF boolean filter logic.
In `@deepnote_toolkit/ocelots/utils.py`:
- Around line 45-50: Add an explicit typed parameter and docstring to the
exported predicate is_polars_eager_dataframe: change the signature to annotate
the parameter (e.g., df: Any) and keep the TypeGuard[PolarsEagerDF] return, and
add a one- or two-sentence docstring explaining that the function detects a
polars eager DataFrame by checking sys.modules for "polars" and using isinstance
against polars.DataFrame; also mention that it returns True if df is a polars
eager DataFrame and False otherwise, and update imports if needed (Any,
TypeGuard) so the predicate is fully typed and documented.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 0349830b-c4e1-4de0-b1d1-1e484766e6b3
📒 Files selected for processing (6)
deepnote_toolkit/ocelots/__init__.pydeepnote_toolkit/ocelots/dataframe.pydeepnote_toolkit/ocelots/polars/implementation_eager.pydeepnote_toolkit/ocelots/types.pydeepnote_toolkit/ocelots/utils.pytests/unit/test_ocelots.py
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (3)
deepnote_toolkit/ocelots/polars/implementation_eager.py (3)
242-251:⚠️ Potential issue | 🟠 MajorPast-window filters match future dates.
week-ago,month-ago, etc. only set a lower bound. A row dated tomorrow passes these filters.🛠️ Suggested fix
elif relative == "week-ago": - condition = dt_col >= pl.lit(now - timedelta(weeks=1)) + condition = dt_col.is_between( + pl.lit(now - timedelta(weeks=1)), pl.lit(now) + ) elif relative == "month-ago": - condition = dt_col >= pl.lit(now - timedelta(days=30)) + condition = dt_col.is_between( + pl.lit(now - timedelta(days=30)), pl.lit(now) + ) elif relative == "quarter-ago": - condition = dt_col >= pl.lit(now - timedelta(days=90)) + condition = dt_col.is_between( + pl.lit(now - timedelta(days=90)), pl.lit(now) + ) elif relative == "half-year-ago": - condition = dt_col >= pl.lit(now - timedelta(days=180)) + condition = dt_col.is_between( + pl.lit(now - timedelta(days=180)), pl.lit(now) + ) elif relative == "year-ago": - condition = dt_col >= pl.lit(now - timedelta(days=365)) + condition = dt_col.is_between( + pl.lit(now - timedelta(days=365)), pl.lit(now) + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deepnote_toolkit/ocelots/polars/implementation_eager.py` around lines 242 - 251, The relative "past-window" branches (for values like "week-ago", "month-ago", "quarter-ago", "half-year-ago", "year-ago") only apply a lower bound (dt_col >= now - delta) so future dates still pass; update each branch in implementation_eager.py where condition is set (the clauses referencing dt_col and relative) to enforce a two-sided interval by also requiring dt_col <= pl.lit(now) (i.e., set condition to dt_col >= pl.lit(now - timedelta(...)) & dt_col <= pl.lit(now)) so rows dated in the future are excluded.
157-160:⚠️ Potential issue | 🟠 MajorBoolean membership values can crash on non-string input.
filter_obj.comparative_valuesmay contain Pythonboolvalues. Calling.lower()onTrueraisesAttributeError.🛠️ Suggested fix
if dtype == pl.Boolean: values = [ - v.lower() == "true" for v in filter_obj.comparative_values + (v if isinstance(v, bool) else str(v).lower() == "true") + for v in filter_obj.comparative_values ] condition = col.is_in(values)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deepnote_toolkit/ocelots/polars/implementation_eager.py` around lines 157 - 160, In the dtype == pl.Boolean branch in implementation_eager.py where values are computed from filter_obj.comparative_values, currently v.lower() assumes each comparative value is a string and will raise AttributeError for Python bools; update the mapping to safely coerce non-string inputs (e.g., use str(v).lower() == "true" or check isinstance(v, bool) and use v directly) so both bool and string representations ("true"/"false") are handled without exceptions when building values for the boolean filter.
284-287: 🧹 Nitpick | 🔵 Trivial
pl.Utf8alias not handled.
pl.Stringandpl.Utf8are aliases. Older Polars versions useUtf8.♻️ Suggested fix
- elif dtype == pl.String: + elif dtype in (pl.String, pl.Utf8): select_exprs.append( col.str.slice(0, MAX_STRING_CELL_LENGTH).alias(name) )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@deepnote_toolkit/ocelots/polars/implementation_eager.py` around lines 284 - 287, The dtype check only handles pl.String but not the alias pl.Utf8, causing issues on older Polars; update the branch that builds select_exprs (the one using col.str.slice(0, MAX_STRING_CELL_LENGTH).alias(name)) to also trigger when dtype is pl.Utf8 (e.g., check dtype in (pl.String, pl.Utf8) or add an OR condition for pl.Utf8) so both aliases are handled consistently.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@deepnote_toolkit/ocelots/polars/implementation_eager.py`:
- Around line 326-344: The broad except in get_columns_distinct_values silently
swallows all errors; replace it by catching only expected Polars errors (e.g.
import and catch polars.exceptions.ComputeError) and log the failure with the
column name and exception (using the class logger, e.g. self._logger.error or
logging.exception) while letting unexpected exceptions propagate (re-raise) so
real bugs are not hidden; reference get_columns_distinct_values, unique_values
and result[column_name] when implementing the safer except ComputeError as e:
logging and handling logic.
---
Duplicate comments:
In `@deepnote_toolkit/ocelots/polars/implementation_eager.py`:
- Around line 242-251: The relative "past-window" branches (for values like
"week-ago", "month-ago", "quarter-ago", "half-year-ago", "year-ago") only apply
a lower bound (dt_col >= now - delta) so future dates still pass; update each
branch in implementation_eager.py where condition is set (the clauses
referencing dt_col and relative) to enforce a two-sided interval by also
requiring dt_col <= pl.lit(now) (i.e., set condition to dt_col >= pl.lit(now -
timedelta(...)) & dt_col <= pl.lit(now)) so rows dated in the future are
excluded.
- Around line 157-160: In the dtype == pl.Boolean branch in
implementation_eager.py where values are computed from
filter_obj.comparative_values, currently v.lower() assumes each comparative
value is a string and will raise AttributeError for Python bools; update the
mapping to safely coerce non-string inputs (e.g., use str(v).lower() == "true"
or check isinstance(v, bool) and use v directly) so both bool and string
representations ("true"/"false") are handled without exceptions when building
values for the boolean filter.
- Around line 284-287: The dtype check only handles pl.String but not the alias
pl.Utf8, causing issues on older Polars; update the branch that builds
select_exprs (the one using col.str.slice(0,
MAX_STRING_CELL_LENGTH).alias(name)) to also trigger when dtype is pl.Utf8
(e.g., check dtype in (pl.String, pl.Utf8) or add an OR condition for pl.Utf8)
so both aliases are handled consistently.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 0ea66e19-ab6b-4e07-9b33-c9893142b380
📒 Files selected for processing (3)
deepnote_toolkit/ocelots/dataframe.pydeepnote_toolkit/ocelots/polars/implementation_eager.pytests/unit/test_ocelots.py
|
@coderabbitai approve and pause |
|
✅ Actions performedComments resolved and changes approved. Reviews paused. |
Add support for eagerly evaluated Polars DataFrames via a new
PolarsImplementationclass for ocelots.Also adds polars[rtcompat] as a dev dependency and exposes a
lazyproperty on oc.DataFrame to make it easier to distinguish eager vs lazy backends (esp. in Deepnote Cloud).Contributes to BLU-5849
Summary by CodeRabbit
New Features
Tests
Chores