This document describes the steps required to configure Athena PySpark notebooks to authenticate with Tabular and execute queries against Tabular-managed Iceberg tables.
Athena PySpark notebooks are hosted by AWS. But they leverage standard open source Apache Spark and Iceberg to connect to Tabular’s REST catalog as a data source for queries. This integration makes it easy to leverage the full capabilities of Spark, Iceberg, and notebooks without the need to manage any infrastructure.
On this page:
Requirements
- Athena workgroup using the PySpark engine.
- Service account or AWS IAM role mapping configured in Tabular for authentication.
Configuration
Navigate to the Athena Notebook Explorer.
- Click on Create notebook.
- Enter a Name for notebook.
- Open Apache Spark properties and click Custom.
- Click the Edit in Table toggle to set it to Edit in JSON.
- Copy and paste the spark configuration below, but replace
<warehouse>
with your Tabular warehouse name. - Click Create notebook.
{
"spark.sql.catalog.<warehouse>": "org.apache.iceberg.spark.SparkCatalog",
"spark.sql.catalog.<warehouse>.catalog-impl": "org.apache.iceberg.rest.RESTCatalog",
"spark.sql.catalog.<warehouse>.credential": "<credential>",
"spark.sql.catalog.<warehouse>.uri": "https://api.tabular.io/ws/",
"spark.sql.catalog.<warehouse>.warehouse": "<warehouse>",
"spark.sql.defaultCatalog": "<warehouse>",
"spark.sql.extensions": "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions"
}
Query your Iceberg tables with Athena PySpark
In addition to using standard pyspark syntax for interacting with your warehouse, in the Athena Pyspark notebook environment you can also use cell magics.
%%sql
SHOW DATABASES;
%%sql
DESCRIBE TABLE examples.nyc_taxi_yellow;
%%sql
create table if not exists default.taxi_tip_rate_per_passenger_spark as
select
passenger_count,
count(*) as trip_count,
avg(tip_amount / fare_amount) as tip_rate
from examples.nyc_taxi_yellow
where
pickup_time between '2020-03-01' and '2020-03-31'
and passenger_count > 0
and fare_amount > 0
group by passenger_count;
%%sql
select
passenger_count,
trip_count,
format_number(tip_rate * 100, 2) as tip_percentage
from default.taxi_tip_rate_per_passenger_spark
order by passenger_count;