Spark

On this page:

Requirements

To connect Apache Spark to a Tabular warehouse, simply add the catalog to your Spark session using the following configuration:

spark.sql.catalog.<your-warehouse>               org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.<your-warehouse>.catalog-impl  org.apache.iceberg.rest.RESTCatalog
spark.sql.catalog.<your-warehouse>.uri           https://api.tabular.io/ws
spark.sql.catalog.<your-warehouse>.credential    <your-tabular-credential>
spark.sql.catalog.<your-warehouse>.warehouse     <your-warehouse-name>

Your databases and tables will be addressable at <your-warehouse-name>.<db>.<table>.

Using Docker

The easiest way to get up and running with Spark is to use our provided docker image, which bundles Spark, Iceberg, and the Tabular catalog client. The following command creates a Spark environment and a jupyter notebook server available at http://localhost:8888 with some example notebooks.

docker run -it -p 8888:8888 -p 8080:8080 -p 18080:18080 tabulario/spark-tabular notebook <your-tabular-credential> <your-warehouse-name>

You can also replace “notebook” in the above command with any of the following values for interactive spark shells:

  • spark-sql
  • spark-shell
  • pyspark

Example:

docker run -it -p 8888:8888 -p 8080:8080 -p 18080:18080 tabulario/spark-tabular spark-sql <your-tabular-credential> <your-warehouse-name>

Running Locally

Tabular works with any Spark runtime environment including running open source Spark from your local machine.

Download the latest versions of the Spark runtime and Tabular client jars from the Download Resources page.

Download the latest supported version of Spark (currently 3.4.0) by going to https://spark.apache.org/downloads

Untar the tgz file

tar -xvzf spark-3.4.0-bin-hadoop3.tgz`

Copy the two jar files you downloaded in Step 1 into the spark jars directory: ./spark-3.4.0-bin-hadoop3/jars/

Change to the spark home directory

cd spark-3.4.0-bin-hadoop3

Start a spark-sql shell by running the following:

./bin/spark-sql \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
--conf spark.sql.defaultCatalog=tabular \
--conf spark.sql.catalog.tabular=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.tabular.catalog-impl=org.apache.iceberg.rest.RESTCatalog \
--conf spark.sql.catalog.tabular.uri=https://api.tabular.io/ws \
--conf spark.sql.catalog.tabular.credential=<your-tabular-credential> \
--conf spark.sql.catalog.tabular.warehouse=<your-warehouse-name>

Once the spark-sql shell has launched, you can verify that you have access to your Tabular data warehouse by running SHOW CURRENT NAMESPACE.

spark-sql> show current namespace;
tabular
Time taken: 0.217 seconds, Fetched 1 row(s)

AWS EMR

You can access your Tabular data warehouse from Spark running on AWS EMR clusters using the same jars and configurations as you would for any other Spark deployment.

We recommend using EMR 6.11 or later with Spark 3.3 or later.

Example EMR Configuration

[
  {
    "classification":"spark-defaults",
    "properties":{
      "spark.sql.extensions":"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
      "spark.sql.defaultCatalog":"tabular",
      "spark.sql.catalog.tabular":"org.apache.iceberg.spark.SparkCatalog",
      "spark.sql.catalog.tabular.catalog-impl":"org.apache.iceberg.rest.RESTCatalog",
      "spark.sql.catalog.tabular.uri":"https://api.tabular.io/ws",
      "spark.sql.catalog.tabular.credential": "<your-tabular-credential>",
      "spark.sql.catalog.tabular.warehouse": "<your-warehouse-name>"
    }
  }
]