Python

PyIceberg

The open source PyIceberg library works natively with Tabular to provide access to warehouses without the need for a JVM or engine (like Spark, Flink, or Trino).

On this page:

Installing PyIceberg

PyIceberg is available from PyPI and can be installed via pip:

pip install "pyiceberg[s3fs,pyarrow]"

REST Catalog Configuration

Tabular can be configured using the native REST Catalog support in Iceberg via config file, environment variables, or programmatically. Best practices is to limit use of credentials to config files or environment variables to prevent them being unintentionally exposed.

PyIceberg Config

Using a ~/.pyiceberg.yaml configuration file allows you to define multiple catalogs that correspond to your Tabular Warehouse. Each catalog can be referenced directly or the default will be used.

# ~/.pyiceberg.yaml

catalog:
  default:
    uri: https://api.tabular.io/ws/
    warehouse: sandbox
    credential: <credential>

  production:
    uri: https://api.tabular.io/ws/
    warehouse: production
    credential: <credential>

  development:
    uri: https://api.tabular.io/ws/
    warehouse: development
    credential: <credential>

Environment Variable Configuration

Alternatively to a config file, you can specify the catalog configuration using environment variables like the following (__<CATALOG>__ denotes the catalog name, for example __PROD__ will be named prod):

export PYICEBERG_CATALOG__DEFAULT__URI=https://api.tabular.io/ws
export PYICEBERG_CATALOG__DEFAULT__WAREHOUSE=sandbox
export PYICEBERG_CATALOG__DEFAULT__CREDENTIAL=<credential>

export PYICEBERG_CATALOG__PRODUCTION__URI=https://api.tabular.io/ws
export PYICEBERG_CATALOG__PRODUCTION__WAREHOUSE=production
export PYICEBERG_CATALOG__PRODUCTION__CREDENTIAL=<credential>

Programmatic Configuration

Tabular catalog can also be configured programmatically when instantiating a catalog:

# With a ~/.pyiceberg.yaml or environment variable configuration (preferred)

from pyiceberg.catalog import load_catalog

catalog = load_catalog('prod')
catalog.list_namespaces()
# Without a ~/.pyiceberg.yaml
from pyiceberg.catalog import load_catalog

catalog = load_catalog(
    'default', 
    uri='https://api.tabular.io/ws/',
    warehouse='sandbox',
    credential='<credential>'
)
catalog.list_namespaces()

Tabular Python Library

The TabularIO python project is available from PyPI and provides access to a number of utilities and features for working with Tabular. This library depends on PyIceberg and leverages the catalog configuration for referencing tables.

Installing TabularIO

pip install tabulario

Commandline

Installing tabulario provides access to the tab command, which is useful for token exchange operations.

Example (more examples here):

tab request-token <credential> | jq
{
    "access_token": "eyJ0eXAiOiJKV...",
    "issued_token_type": "urn:ietf:params:oauth:token-type:access_token",
    "token_type": "Bearer",
    "expires_in": 86400,
    "scope": null,
    "warehouse_id": "8bcb0838-50fc-472d-9ddb-8feb89ef5f1e",
    "region": "us-west-2"
}

File Loader

Enabling file loading on a table is achieved by using the loader package.

First enable loading:

from tabular import loader

loader.enable_loading(
    identifier="prod.default.invoices_table", 
    file_type="csv", 
    mode="append", 
    override=True
)

Full documentation:

Signature:
loader.enable_loading(
    identifier: Union[Tuple[str, ...], str],
    file_type: Literal['csv', 'json', 'parquet'],
    mode: Literal['append', 'replace'],
    delim: str = ',',
    catalog: Optional[pyiceberg.catalog.Catalog] = None,
    override: bool = False,
)
Docstring:
Enable data loader for a given table. See https://docs.tabular.io/tables

:param identifier: table identifier string or tuple
:param file_type: csv, json, parquet
:param mode: append or replace
:param delim: delimiter for csv files
:param catalog: optional catalog ('default' if not provided in table identifier)
:param override: override the loader configuration if already enabled on the target table

Then load data into the table (loading process may take a few minutes):

loader.ingest("acme.default.f3", "/tmp/data.json")

Full documentation:

Signature:
loader.ingest(
    identifier: Union[Tuple[str, ...], str],
    file: str,
    catalog: Optional[pyiceberg.catalog.Catalog] = None,
)
Docstring:
Ingest data into the provided Iceberg table by copying the provided file
into the configured loader path in S3.

:param identifier: target table for loading data
:param file: path to the file to be loaded
:param catalog: optional catalog ('default' if not provided in table identifier)