Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 19 additions & 7 deletions docs/source/user-guide/data-sources.rst
Original file line number Diff line number Diff line change
Expand Up @@ -224,25 +224,37 @@ A common technique for organizing tables is using a three level hierarchical app
supports this form of organizing using the :py:class:`~datafusion.catalog.Catalog`,
:py:class:`~datafusion.catalog.Schema`, and :py:class:`~datafusion.catalog.Table`. By default,
a :py:class:`~datafusion.context.SessionContext` comes with a single Catalog and a single Schema
with the names ``datafusion`` and ``default``, respectively.
with the names ``datafusion`` and ``public``, respectively.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 verified locally

uv run --with datafusion python3 -c "
from datafusion import SessionContext
ctx = SessionContext()
print('Catalog names:', ctx.catalog_names())
catalog = ctx.catalog('datafusion')
print('Schema names:', catalog.names())
"

Catalog names: {'datafusion'}
Schema names: {'public'}


The default implementation uses an in-memory approach to the catalog and schema. We have support
for adding additional in-memory catalogs and schemas. This can be done like in the following
for adding additional in-memory catalogs and schemas. You can access tables registered in a schema
either through the Dataframe API or via sql commands. This can be done like in the following
example:

.. code-block:: python

import pyarrow as pa
from datafusion.catalog import Catalog, Schema
from datafusion import SessionContext

ctx = SessionContext()

my_catalog = Catalog.memory_catalog()
my_schema = Schema.memory_schema()
my_schema = Schema.memory_schema()
my_catalog.register_schema('my_schema_name', my_schema)
ctx.register_catalog_provider('my_catalog_name', my_catalog)

my_catalog.register_schema("my_schema_name", my_schema)
# Create an in-memory table
table = pa.table({
'name': ['Bulbasaur', 'Charmander', 'Squirtle'],
'type': ['Grass', 'Fire', 'Water'],
'hp': [45, 39, 44],
})
df = ctx.create_dataframe([table.to_batches()], name='pokemon')

ctx.register_catalog("my_catalog_name", my_catalog)
my_schema.register_table('pokemon', df)

You could then register tables in ``my_schema`` and access them either through the DataFrame
API or via sql commands such as ``"SELECT * from my_catalog_name.my_schema_name.my_table"``.
ctx.sql('SELECT * FROM my_catalog_name.my_schema_name.pokemon').show()

User Defined Catalog and Schema
-------------------------------
Expand Down