File Loader

File Loader provides a simple way to load data into a Tabular table by automatically loading data files from a designated s3 location.

On this page:

How does File Loader work?

Currently, the File Loader uses S3 notifications configured for a specific path within your bucket to “listen” for new files being created. By using S3 notifications to signal files that need to be loaded, the File Loader enables low latencies for data from newly created files to be available within the Tabular table, typically within 5 minutes.

Initial Setup

In order to use an s3 bucket as a source for File Loader, you must configure the AWS IAM role for a Tabular warehouse with appropriate permissions to the s3 bucket and create an s3 bucket notification event to notify Tabular when new data files are available to load.

If the File Loader location is in the same bucket as the table being loaded, then no additional IAM privileges are required; just the s3 bucket notification.

IAM Policy

Tabular requires the following actions on an s3 bucket as a File Loader source:

  • s3:ListBucket
  • s3:GetBucketLocation
  • s3:GetBucketNotification
  • s3:PutBucketNotification
  • s3:GetObject

Example IAM policy (substitute <s3-bucket-name> with your s3 bucket name)

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:ListBucket",
        "s3:GetBucketLocation",
        "s3:GetBucketNotification",
        "s3:PutBucketNotification"
      ],
     "Resource": [
        "arn:aws:s3:::<s3-bucket-name>"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject"
      ],
      "Resource": [
         "arn:aws:s3:::<s3-bucket-name>/<file-loader-prefix>/*"
      ]
    }
  ]
}

Event Notification for File Loader

When creating the s3 bucket event notification, use the following settings:

  • Prefix (optional): <file-loader-prefix>/
  • Event Types: All object create events
  • Destination: SQS topic
  • SQS topic ARN: arn:aws:sqs:us-east-1:237881912361:tabular-loader-s3-notifications-queue

The <file-loader-prefix> is optional. If provided, only files that are created under that prefix will be eligible for auto-loading. The default prefix is tabular/staged/.

Configuring File Loader for a Table

In the Tabular app, navigate to the table you want to enable File Loader for and click on the SETTINGS tab. Scroll down to the File Loader section that should look like below:

button

File Loader Settings

Click EDIT in the upper right-hand corner,

You must select a file format for the input data files. See below for supported file formats.

Choose a write-mode that fits your use case:

  • append - File Loader inserts data from files created into the table.
  • replace - File Loader fully replaces the existing data in the table with data from files created within the 5-minute window.

By default, the File Loader looks in a predefined path for files to load into the table with the following format:

s3://{warehouse-bucket}/tabular/staged/{warehouse-name}/{database-name}/{table-name}

You can also use the Tabular Python Library to enable and use the File Loader.

Supported File Formats

File Loader currently supports loading data from the following file formats:

  • CSV (*)
  • JSON
  • Parquet

*For CSV file format, you can set the column delimiter using the table property fileloader.csv.column-delimiter (e.g. for TSV, this is set to \t). This table property is only applicable if the file format is CSV, which the default expects comma-delimited.

Excluding Files From Loading

By default, the File Loader processes and loads any file into the table that is created within the configured File Loader path for that table. However, there may be cases where files are created within the table’s file load path that you do not want to have loaded into the table (e.g. temporary or intermediate files created by the process creating the files). You can set the table property fileloader.file-exclude-glob-filter to a glob pattern, and the File Loader will ignore any files it finds that match that pattern.

For example, setting fileloader.file-exclude-glob-filter to **/_* will ignore any file name that starts with an underscore. If you wanted to ignore all files within a certain “folder” within the table’s file load path (e.g. /tmp/), you would set that table property to /tmp/* or something similar.

Loading Hive Partition Values

For use cases that load data files dropped into a Hive directory structure (e.g. mirroring a frequently updated Hive table), the File Loader allows loading Hive-like partition columns and values into your Tabular table by setting the table property fileloader.parse-path-values to true (default is not set false). If this is set to true, the File Loader expects these partition columns to be in your Tabular table’s schema.

For example, for the following file:

<table_file_load_path>/dt=2023-01-21/cat=foo/file1.csv

If fileloader.parse-path-values is true, the File Loader sets column dt to 2023-01-21 and cat to foo for all rows when loading this data file.

File Loader FAQ

How does the File Loader handle schema?

For the most part, the File Loader expects files it tries to load in to match the table schema. However, the File Loader will try and make a best effort to coerce the incoming data values to match the expected types for each field (this can be dependent on the file-format of the input files). The File Loader will also ignore extra fields and insert null values for missing fields.