Implementing Change Data Capture (CDC) in Tabular

On this page:

In CDC, as relational tables are modified an update stream keeps copies in sync, capturing changes as they happen. As new database events occur, data is moved and processed continuously in real-time or near-real-time.

Tabular’s Change Data Capture (CDC) is a fully-automated service for merging a set of change events from sources such as Debezium or AWS Database Migration Service into a Tabular-managed Iceberg table.

The Tabular CDC implementation uses table mirroring, avoiding the more complex operations that traditional data warehouses require to perform CDC.

  1. You send CDC data into Iceberg
  2. Tabular automatically assembles the raw Iceberg table and materializes the merge table that mirrors the source.
    • This is typically with <5-minute latency.
  3. Tabular splits the merge into inserts and updates/deletes.
  4. It executes the merge with the updates/deletes and appends the inserts.

Create a new role

CDC is lightweight on the source database; it’s comparatively simple to copy the database and then maintain that copy. Nevertheless, getting CDC right is challenging.

CDC in Tabular requires 2 tables:

  1. A change event table that materializes all change events from your transactional database upstream
  2. A target table into which you merge the change events.

Broadly, here’s how they work together:

  1. Each commit to the change log table queues up a CDC merge job for all dependent tables.
  2. The merge job runs, provided there isn’t one already in progress and the target-lag amount of time has passed since the previous job
    • Each merge job incrementally reads the most recent set of events from the change log and merges them into the target table
    • If schema changes are detected, Tabular automatically propagates the schema from the change event table to the target table.

Note The Change events table will substantially outgrow the source table. But Tabular’s automated compaction services maintain the health of the change events table.

Creating a CDC Pipeline via the Tabular UI

It establishes the connection between the two tables and begins running the update process automatically. No further intervention on your part is required. Along the way Tabular ensures there is no missing or duplicate data.

Below are the steps for creating this pipeline entirely within the Tabular UI. The process works much like a wizard, leading you to each successive step.

Need or prefer to establish this pipeline manually? After the below procedure we provide the basic parameters and settings you need.

To create a pipeline via the Tabular UI

  1. Navigate to the database you want to contain your CDC tables.
  2. Click Plus icon. From the list that displays click Create CDC Pipeline. The Create CDC Pipeline page displays.
    • This leads you to choose CDC data from a Tabular storage profile and then create a target table.
  3. Select the folder containing the CDC files that represent changes to your table. It may require several clicks to get to the data you want When you’re done, the folder path should look something like this:

Create a new role

  1. Click Select Path.
  2. Select the type of change files found in your selected path. To do this, click the File format down arrow and select from the list that displays.
    • Typically these will be Apache Parquet files.
  3. Click Configure Pipeline Settings.
  4. Configure the changelog settings. To do this, first verify that the Changelog database name is correct. Then click Specify Changelog Order Columns and select the column(s) you wish to order.
  5. Enter the merge frequency you wish – that is, the length of time, in minutes, between merge operations.
    • The default is 5.
  6. Click Review Target Schema.
  7. When the schema page displays, verify the name of the target table, and add an optional description, if you wish.
  8. Specify the primary key for your data using the ID checkboxes within your target schema.
    • Any required fields display an empty ID checkbox. You must check this box before you can create the pipeline.
  9. Optionally, add partitioning fields, write order fields, and any custom table properties. You can also add more columns.
  10. When you’re done, click Create Pipeline. The Create CDC Pipeline page displays, indicating the pipeline has been created and also specifying the changelog (source) table and the target (merge) table.
    • This configures Tabular File Loader to load any pre-existing data and autoload any new files that land in the future.
    • Details at the bottom indicate existing files are loading, how many files are loading, and how much data is loading.
  11. Click Close.

Configuring CDC Manually

You can use any tool (such as PyIceberg or Apache Spark) that enables you to create Iceberg tables and set table properties.

When creating the changelog table:

  1. You must include a column or struct field named “op.” This column indicates the type of operation performed – insert, update, or delete – so Tabular knows what to do with the incoming record.
  2. Specify which column(s) will act as the unique row identifier(s).
  3. Set the ordering of the change events to put into the target table * Transaction ID is best. Timestamp also works well.

Partition or order this table using the column(s) that represent the ordering of the change events. If you are using a timestamp column, we recommend using an Iceberg partition transform (such as hour or day).

When creating the target (mirror) table:

This is a copy of the transactional table; the schema of this table must match the schema of the source table for the change events, except for the CDC “op” column, which you omit here.

  1. Indicate the column(s) that will act as the unique row identifier(s). Row ID is used as the primary key on which to merge new records.
  2. Create a write order for this table.
    • We recommend that order by the same columns that you set as row identifiers, starting with the column with the lowest expected cardinality.

Configure the table properties.

Merge Table Properties

etl.job-typecdcrequired positive integer representing the number of minutes to wait between merge operations. When not set, data is merged as fast as possible.0optional
cdc.ts-columnComma-separated list of columns used to order the change eventsrequired
cdc.key-columnComma-separated list of columns that uniquely identify a row in the table.If the table’s Identifier Field IDs are set, those are preferred:

Change Event Table Properties

dependent-tablesComma-separated list of table identifiers (including database) that may need to be updated whenever this table is updated. These are the target tables into which you are merging the change events – for example, db.mirror1,db.mirror2required

Note You can merge into multiple tables – for example, because they’re differently partitioned, or are sorted for different use cases or performance reasons.