Snapshots

On this page:

In Apache Iceberg, every change to the data in a table creates a new version of the table, called a snapshot. Iceberg metadata keeps track of multiple snapshots at the same time; this allows for

  • enough time for readers to finish using old snapshots, providing isolation.
  • incremental consumption
  • time travel queries.

Only changed files are rewritten to produce a new snapshot. The majority of the existing data and metadata is reused across snapshots to greatly reduce write amplification.

Snapshots are unique to Iceberg. They are an important aspect of table maintenance (expiring historical snapshots, which Tabular does automatically, improves performance) and fundamental for time travel queries, as they are the unit Iceberg uses to keep track of changes to a table. Typically, people worked with Iceberg snapshots via the Iceberg CLI layer, or via the Java API. But Tabular enables you to perform many of these same functions via the UI. These functions include branching and tagging of snapshots.

Understanding tagging and branching

Branches are independent lineages of snapshots and point to the head of the lineage. Branching enables you to write to a table and validate data before making it visible for others to consume. For example, you can create a test branch; update the test branch with new data; and query the test branch to validate that the changes pass quality checks before publishing those changes back to the main branch for downstream consumers.

Another example: You can use branches as part of a write/audit/publish workflow – that is, you can write to the branch, perform data quality checks, and when you have confirmed the data is clean you can fast forward to the main table state to that branch.

Tracking branches works much like the commits in a git repository; they are lightweight named references that point to an ID.

Tags are named references to snapshots that have their own specific retention requirements. For example, you may need to delete PII for a customer as part of a GDPR “right to be forgotten” request, but also need to keep one “end of period” snapshot for longer to comply with financial audit requirements. Tags also differ from branches in that you cannot write to a tag.

Tagging a snapshot can make it easier to locate that snapshot and/or keep it distinct for a specific business purpose or workflow – for example, as part of period-end financial reporting or for training a specific ML model.

In Tabular, each branch and each tag has its own retention policy, which you can set via the Tabular UI. You can also set this policy via engines such as Spark or via the Iceberg Java API.

This is critical for maintaining compliance with regulations such as GDPR, which may require customer data be deleted that you may otherwise need to retain for purposes of a financial audit. For example, you can tag a single snapshot with its own retention period (one year for financial auditing, for example), while all other snapshots can be expired more frequently because each has its own discrete retention policy. In this way you can use branching and tagging to expire customer data you are required to expire within a specified regulatory timeline, while preserving only the minimum amount of data for the minimum length of time demanded to satisfy a separate requirement.

Managing snapshots

To work with snapshots, including branching them and tagging them, navigate to the table you want, and in the table overview page click Snapshots.

The Snapshots overview page displays comprehensive information about each snapshot. From here you can branch and tag snapshots and drill down to view granular details about each snapshot.

The top portion of the snapshot page displays table information in chart form, including time-stamped snapshot details over a period of time:

  • total record count
  • added records
  • deleted records

Hover over any point in the graph to see table information for the corresponding snapshot. You can filter out any of this by toggling the text to the right of the graph. Click the corresponding text to hide it; click it again to re-display it.

Note    You do not configure snapshot retention from this page. Instead, access retention settings by navigating to the table overview page and clicking Settings.

In the lower portion of the Snapshots overview is a list of every snapshot of this table.

To swiftly view a large number of snapshots

  • Click the Show per page box and select from 10, 25, 50, or 100 snapshots per page.

To view details of an individual snapshot

  • Next to the snapshot you want, click the down arrow. The row expands to display a comprehensive range of information, including operation type, file size, delete types, and more. To collapse the details, click the up arrow.

To copy a snapshot ID

  • Click the copy icon.

To delete old snapshots

  • Contact your organization’s Tabular Security Admin. Tabular does provide an easy method for manually deleting old snapshots, but only Tabular Security Admins have permission to do this. You can also consult the Iceberg documentation (here and here) for more details on expiring snapshots.

Creating and tagging branches

Using the Tabular UI to work with branches can save time over, for instance, performing the same function in Spark. In addition, some compute engines don’t necessarily possess this functionality at all; Trino, for example, can read from branches but cannot create them.

Note    You can also create a tag or branch using the Iceberg Table API, or using extended table DDL in Spark. Details in the Tabular Apache Iceberg Cookbook.

To create a branch

  1. Navigate to the table you wish and click Snapshots.

  2. In the list of snapshots that displays, next to the snapshot you want, click the branching icon.

  3. When prompted, enter a branch name and click Create Branch. Your new branch displays in the list, and is also added to the Branch name drop-down menu just above the branch list.

    Note    Each branch name must be unique.

button

Viewing events

You can also create a sub-branch (that is, a branch of a branch).

It’s not unusual, over time, to accrue a large number of branches. But you can quickly filter the view down to display only those snapshots in a specific branch.

To view only the snapshots in a specific branch

  • Above the timestamp column, click the branch selector down arrow (by default it displays as main). Then from the list that displays, select the branch you wish to view.

Rolling back a branch

At any time you can roll back the current branch of a snapshot you’re on to return your table and branch back to the state immediately prior.

Note   Rolling back a branch acts only on end branches. You cannot roll back the most recent version of the Main branch. In the image below, note the rollback function for the most recent main branch is unavailable.

button

Rollback function

To roll back a branch

  • Next to the branch you want, click the refresh icon.

Deleting a branch

Unlike with snapshots, you cannot use the Tabular UI to delete a branch; for that, you must use an engine such as Spark. For details, see the Apache Iceberg documentation and the Branches and Tags chapter of the Tabular Iceberg Cookbook.

Creating tags

You can tag either a snapshot or a branch. When you create a tag you also have the option of setting a retention period that applies only to the tagged snapshot or branch.

To add a tag to a branch and set a discrete retention policy for the tagged data

  1. Navigate to the table you wish and click Snapshots.
  2. In the list that displays, next to the snapshot or branch you want, click +.
  3. When prompted, enter a tag name.
  4. Enter a retention period for the branch or snapshot you’re tagging.
  5. Click Create tag.

Important    Tag retention policies override the table defaults for retention. This is how you can, for example, create a snapshot branch for monthly financial data and tag and preserve it for one year, even though the retention policy for that table overall may only be 30 days.

Note    Any one snapshot can be the root of a branch.

Table Properties

In addition to the standard Apache Iceberg table properties, Tabular supports additional properties to enable and configure Tabular’s automated services.

Snapshot management properties


PropertyDefaultDescription
history.expire.max-snapshot-age-ms432000000 (5 days)Default max age of snapshots to keep while expiring snapshots
history.expire.min-snapshots-to-keep1Default min number of snapshots to keep while expiring snapshots
history.expire.max-ref-age-msLong.MAX_VALUE (forever)For snapshot references except the main branch, default max age of snapshot references to keep while expiring snapshots. The main branch never expires.