PyIceberg
The open source PyIceberg library works natively with Tabular to provide access to warehouses without the need for a JVM or engine (like Spark, Flink, or Trino).
On this page:
Installing PyIceberg
PyIceberg is available from PyPI and can be installed via pip:
pip install "pyiceberg[s3fs,pyarrow]"
REST Catalog Configuration
Tabular can be configured using the native REST Catalog support in Iceberg via config file, environment variables, or programmatically. Best practices is to limit use of credentials to config files or environment variables to prevent them being unintentionally exposed.
PyIceberg Config
Using a ~/.pyiceberg.yaml
configuration file allows you to define multiple catalogs that correspond
to your Tabular Warehouse. Each catalog can be referenced directly or the default
will be used.
# ~/.pyiceberg.yaml
catalog:
default:
uri: https://api.tabular.io/ws/
warehouse: sandbox
credential: <credential>
production:
uri: https://api.tabular.io/ws/
warehouse: production
credential: <credential>
development:
uri: https://api.tabular.io/ws/
warehouse: development
credential: <credential>
Environment Variable Configuration
Alternatively to a config file, you can specify the catalog configuration using environment variables
like the following (__<CATALOG>__
denotes the catalog name, for example __PROD__
will be named prod
):
export PYICEBERG_CATALOG__DEFAULT__URI=https://api.tabular.io/ws
export PYICEBERG_CATALOG__DEFAULT__WAREHOUSE=sandbox
export PYICEBERG_CATALOG__DEFAULT__CREDENTIAL=<credential>
export PYICEBERG_CATALOG__PRODUCTION__URI=https://api.tabular.io/ws
export PYICEBERG_CATALOG__PRODUCTION__WAREHOUSE=production
export PYICEBERG_CATALOG__PRODUCTION__CREDENTIAL=<credential>
Programmatic Configuration
Tabular catalog can also be configured programmatically when instantiating a catalog:
# With a ~/.pyiceberg.yaml or environment variable configuration (preferred)
from pyiceberg.catalog import load_catalog
catalog = load_catalog('prod')
catalog.list_namespaces()
# Without a ~/.pyiceberg.yaml
from pyiceberg.catalog import load_catalog
catalog = load_catalog(
'default',
uri='https://api.tabular.io/ws/',
warehouse='sandbox',
credential='<credential>'
)
catalog.list_namespaces()
Tabular Python Library
The TabularIO python project is available from PyPI and provides access to a number of utilities and features for working with Tabular. This library depends on PyIceberg and leverages the catalog configuration for referencing tables.
Installing TabularIO
pip install tabulario
Commandline
Installing tabulario
provides access to the tab
command, which is useful for token
exchange operations.
Example (more examples here):
tab request-token <credential> | jq
{
"access_token": "eyJ0eXAiOiJKV...",
"issued_token_type": "urn:ietf:params:oauth:token-type:access_token",
"token_type": "Bearer",
"expires_in": 86400,
"scope": null,
"warehouse_id": "8bcb0838-50fc-472d-9ddb-8feb89ef5f1e",
"region": "us-west-2"
}
File Loader
Enabling file loading on a table is achieved by using the loader
package.
First enable loading:
from tabular import loader
loader.enable_loading(
identifier="prod.default.invoices_table",
file_type="csv",
mode="append",
override=True
)
Full documentation:
Signature:
loader.enable_loading(
identifier: Union[Tuple[str, ...], str],
file_type: Literal['csv', 'json', 'parquet'],
mode: Literal['append', 'replace'],
delim: str = ',',
catalog: Optional[pyiceberg.catalog.Catalog] = None,
override: bool = False,
)
Docstring:
Enable data loader for a given table. See https://docs.tabular.io/tables
:param identifier: table identifier string or tuple
:param file_type: csv, json, parquet
:param mode: append or replace
:param delim: delimiter for csv files
:param catalog: optional catalog ('default' if not provided in table identifier)
:param override: override the loader configuration if already enabled on the target table
Then load data into the table (loading process may take a few minutes):
loader.ingest("acme.default.f3", "/tmp/data.json")
Full documentation:
Signature:
loader.ingest(
identifier: Union[Tuple[str, ...], str],
file: str,
catalog: Optional[pyiceberg.catalog.Catalog] = None,
)
Docstring:
Ingest data into the provided Iceberg table by copying the provided file
into the configured loader path in S3.
:param identifier: target table for loading data
:param file: path to the file to be loaded
:param catalog: optional catalog ('default' if not provided in table identifier)