File Loader

File Loader provides a simple way to load data into a Tabular table by automatically loading data files from a designated S3 location.

On this page:

How does File Loader work?

Currently, the File Loader uses S3 notifications configured for a specific path within your bucket to “listen” for new files being created. By using S3 notifications to signal files that need to be loaded, the File Loader enables low latencies for data from newly created files to be available within the Tabular table, typically within 5 minutes.

Initial Setup

In order to use an S3 bucket as a source for File Loader, you must configure the AWS IAM role for a Tabular warehouse with appropriate permissions to the S3 bucket and create an S3 bucket notification event to notify Tabular when new data files are available to load.

If the File Loader location is in the same bucket as the table being loaded, then no additional IAM privileges are required; just the S3 bucket notification.

IAM Policy

Tabular requires the following actions on an S3 bucket as a File Loader source:

  • s3:ListBucket
  • s3:GetBucketLocation
  • s3:GetBucketNotification
  • s3:PutBucketNotification
  • s3:GetObject

Example IAM policy (substitute <s3-bucket-name> with your S3 bucket name)

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:ListBucket",
        "s3:GetBucketLocation",
        "s3:GetBucketNotification",
        "s3:PutBucketNotification"
      ],
      "Resource": ["arn:aws:s3:::<s3-bucket-name>"]
    },
    {
      "Effect": "Allow",
      "Action": ["s3:GetObject"],
      "Resource": ["arn:aws:s3:::<s3-bucket-name>/<file-loader-prefix>/*"]
    }
  ]
}

Event Notification for File Loader

When creating the S3 bucket event notification, use the following settings:

  • Prefix (optional): <file-loader-prefix>/
  • Event Types: All object create events
  • Destination: SQS topic
  • SQS topic ARN: arn:aws:sqs:<region>:237881912361:tabular-loader-s3-notifications-queue

The <file-loader-prefix> is optional. If provided, only files that are created under that prefix will be eligible for auto-loading. The default prefix is tabular/staged/.

Replace <region> with the relevant AWS region name – for example, us-west-2, us-east-1, and so on

Configuring File Loader for a Table

In the Tabular app, navigate to the table you want to enable File Loader for and click on the SETTINGS tab. Scroll down to the File Loader section that should look like below:

button

File Loader Settings

Click EDIT in the upper right-hand corner,

You must select a file format for the input data files. See below for supported file formats.

Choose a write-mode that fits your use case:

  • append - File Loader inserts data from files created into the table.
  • replace - File Loader fully replaces the existing data in the table with data from files created within the 5-minute window.

By default, the File Loader looks in a predefined path for files to load into the table with the following format:

s3://{warehouse-bucket}/tabular/staged/{warehouse-name}/{database-name}/{table-name}

You can also use the Tabular Python Library to enable and use the File Loader.

Supported File Formats

File Loader currently supports loading data from the following file formats:

  • CSV (*)
  • JSON
  • Parquet
  • XML (**)

*For CSV file format, you can set the column delimiter using the table property fileloader.csv.column-delimiter (e.g. for TSV, this is set to \t). This table property is only applicable if the file format is CSV, which the default expects comma-delimited.

**For XML file format, you can set the tag that represents a row in your data using the table property fileloader.xml.rowTag. The property value should be the name of the tag without angle braces, the default is ROW which would match a <ROW> tag in the source XML file.

Schema Resolution

By default, the File Loader expects files it tries to load in to match the table schema. However, the File Loader will try and make a best effort to coerce the incoming data values to match the expected types for each field (this can be dependent on the file-format of the input files). The File Loader will also ignore extra fields and insert null values for missing fields.

Automatic Schema Evolution

The File Loader can be configured to automatically evolve the table schema to match the schema found in the incoming data files. You can set the table property fileloader.evolve-schema to true (if not set, default is false).

Case sensitivity

By default, the File Loader treats column names as case-sensitive and will only load values into the table when the column name in the source file matches exactly. You can configure the File Loader to use case-insensitive column name matching by setting the table property fileloader.case-insensitive to true (if not set, default is false).

Including File Metadata

The File Loader can automatically include metadata about the source files as part of each row in the table.

The metadata will be included if the table has a column named by _load_metadata. You can also customize the name of this column by setting the table property fileloader.metadata-field-name.

The target column should be of type struct with the following nested fields:

nametypedescription
source_file_namestringname of the source file
source_file_creation_tstimestamps3 creation time for the source file
load_tstimestamptime the file was loaded
load_numlongrow position in the file

Excluding Files From Loading

By default, the File Loader processes and loads any file into the table that is created within the configured File Loader path for that table. However, there may be cases where files are created within the table’s file load path that you do not want to have loaded into the table (e.g. temporary or intermediate files created by the process creating the files). You can set the table property fileloader.file-exclude-glob-filter to a glob pattern, and the File Loader will ignore any files it finds that match that pattern.

For example, setting fileloader.file-exclude-glob-filter to **/_* will ignore any file name that starts with an underscore. If you wanted to ignore all files within a certain “folder” within the table’s file load path (e.g. /tmp/), you would set that table property to /tmp/* or something similar.

Loading Hive Partition Values

For use cases that load data files dropped into a Hive directory structure (e.g. mirroring a frequently updated Hive table), the File Loader allows loading Hive-like partition columns and values into your Tabular table by setting the table property fileloader.parse-path-values to true (if not set, default is false). If this is set to true, the File Loader expects these partition columns to be in your Tabular table’s schema.

For example, for the following file:

<table_file_load_path>/dt=2023-01-21/cat=foo/file1.csv

If fileloader.parse-path-values is true, the File Loader sets column dt to 2023-01-21 and cat to foo for all rows when loading this data file.