File Loader

File Loader provides a simple way to load data into a Tabular table by automatically loading data files from a designated S3 location.

On this page:

How does File Loader work?

Currently, the File Loader uses S3 notifications configured for a specific path within your bucket to “listen” for new files being created. By using S3 notifications to signal files that need to be loaded, the File Loader enables low latencies for data from newly created files to be available within the Tabular table, typically within 5 minutes.

Initial Setup

In order to use an S3 bucket as a source for File Loader, you must configure the AWS IAM role for a Tabular warehouse with appropriate permissions to the S3 bucket and create an S3 bucket notification event to notify Tabular when new data files are available to load.

If the File Loader location is in the same bucket as the table being loaded, then no additional IAM privileges are required; just the S3 bucket notification.

IAM Policy

Tabular requires the following actions on an S3 bucket as a File Loader source:

  • s3:ListBucket
  • s3:GetBucketLocation
  • s3:GetBucketNotification
  • s3:PutBucketNotification
  • s3:GetObject

Example IAM policy (substitute <s3-bucket-name> with your S3 bucket name)

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:ListBucket",
        "s3:GetBucketLocation",
        "s3:GetBucketNotification",
        "s3:PutBucketNotification"
      ],
      "Resource": ["arn:aws:s3:::<s3-bucket-name>"]
    },
    {
      "Effect": "Allow",
      "Action": ["s3:GetObject"],
      "Resource": ["arn:aws:s3:::<s3-bucket-name>/<file-loader-prefix>/*"]
    }
  ]
}

Event Notification for File Loader

When creating the S3 bucket event notification, use the following settings:

  • Prefix (optional): <file-loader-prefix>/
  • Event Types: All object create events
  • Destination: SQS topic
  • SQS topic ARN: arn:aws:sqs:<region>:237881912361:tabular-loader-s3-notifications-queue

The <file-loader-prefix> is optional. If provided, only files that are created under that prefix will be eligible for auto-loading. The default prefix is tabular/staged/.

Replace <region> with the relevant AWS region name – for example, us-west-2, us-east-1, and so on

Configuring File Loader for a Table

In the Tabular app, navigate to the table you want to enable File Loader for and click on the SETTINGS tab. Scroll down to the File Loader section that should look like below:

button

File Loader Settings

Click EDIT in the upper right-hand corner,

You must select a file format for the input data files. See below for supported file formats.

Choose a write-mode that fits your use case:

  • append - File Loader inserts data from files created into the table.
  • replace - File Loader fully replaces the existing data in the table with data from files created within the 5-minute window.

By default, the File Loader looks in a predefined path for files to load into the table with the following format:

s3://{warehouse-bucket}/tabular/staged/{warehouse-name}/{database-name}/{table-name}

You can also use the Tabular Python Library to enable and use the File Loader.

Supported File Formats

File Loader currently supports loading data from the following file formats:

  • CSV (*)
  • JSON
  • Parquet
  • XML (**)

*For CSV file format, you can set the column delimiter using the table property fileloader.csv.column-delimiter (e.g. for TSV, this is set to \t). This table property is only applicable if the file format is CSV, which the default expects comma-delimited.

**For XML file format, you can set the tag that represents a row in your data using the table property fileloader.xml.rowTag. The property value should be the name of the tag without angle braces, the default is ROW which would match a <ROW> tag in the source XML file.

Schema Resolution

By default, the File Loader expects files it tries to load in to match the table schema. However, the File Loader will try and make a best effort to coerce the incoming data values to match the expected types for each field (this can be dependent on the file-format of the input files). The File Loader will also ignore extra fields and insert null values for missing fields.

Automatic Schema Evolution

The File Loader can be configured to automatically evolve the table schema to match the schema found in the incoming data files. You can set the table property fileloader.evolve-schema to true (if not set, default is false).

Case sensitivity

By default, the File Loader treats column names as case-sensitive and will only load values into the table when the column name in the source file matches exactly. You can configure the File Loader to use case-insensitive column name matching by setting the table property fileloader.case-insensitive to true (if not set, default is false).

Including File Metadata

The File Loader can automatically include metadata about the source files as part of each row in the table.

The metadata will be included if the table has a column named by _load_metadata. You can also customize the name of this column by setting the table property fileloader.metadata-field-name.

The target column should be of type struct with the following nested fields:

nametypedescription
source_file_namestringname of the source file
source_file_creation_tstimestamps3 creation time for the source file
load_tstimestamptime the file was loaded
load_numlongrow position in the file

Excluding Files From Loading

By default, the File Loader processes and loads any file into the table that is created within the configured File Loader path for that table. However, there may be cases where files are created within the table’s file load path that you do not want to have loaded into the table (e.g. temporary or intermediate files created by the process creating the files). You can set the table property fileloader.file-exclude-glob-filter to a glob pattern, and the File Loader will ignore any files it finds that match that pattern.

For example, setting fileloader.file-exclude-glob-filter to **/_* will ignore any file name that starts with an underscore. If you wanted to ignore all files within a certain “folder” within the table’s file load path (e.g. /tmp/), you would set that table property to /tmp/* or something similar.

Loading Hive Partition Values

For use cases that load data files dropped into a Hive directory structure (e.g. mirroring a frequently updated Hive table), the File Loader allows loading Hive-like partition columns and values into your Tabular table by setting the table property fileloader.parse-path-values to true (if not set, default is false). If this is set to true, the File Loader expects these partition columns to be in your Tabular table’s schema.

For example, for the following file:

<table_file_load_path>/dt=2023-01-21/cat=foo/file1.csv

If fileloader.parse-path-values is true, the File Loader sets column dt to 2023-01-21 and cat to foo for all rows when loading this data file.

Loading Into a Branch

File Loader supports loading data into a specific branch of the table. By default, File Loader will load into the main branch of the table. To change this, set fileloader.table-branch to the name of the desired branch to be loaded into. This pattern is particularly useful if you want to perform any data quality checks or data cleanup activities like deduplication prior to the loaded data being available for downstream consumption from the main branch.

The target branch must exist otherwise the File Loader will fail.

File Loader Configuration Reference

PropertyDescription
fileloader.enabledtrue/false - determines if we try and match and load new files for this table
fileloader.pathpath we use to match newly created files to table (for example, ‘S3://bucket-name/path’)
fileloader.file-formatjson, csv, parquet, orc, xml
fileloader.write-modeappend, replace
fileloader.parse-path-valuestrue/false - if ’true’, any ’name=value’ path parts will be parsed and added as columns/values to the input data
fileloader.file-exclude-glob-filterany files that match the glob pattern set will not be loaded into the table
fileloader.target-branchname of the table branch for data to be loaded into
fileloader.evolve-schematrue/false - determines if file loader should attempt to evolve the table schema based on the incoming file data
fileloader.case-insensitivetrue/false - determines if file loader should use case-insensitive matching for field names between the file schema and the table schema
fileloader.load-seriallytrue/false - determines if file loader should process a single incoming batch of files at a time (default: false)
fileloader.csv.column-delimiterset a custom field/value delimiter – only applicable for file-format = ‘csv’ (default: ‘,’)
fileloader.csv.headertrue/false – if ’true’, file loader will assume the first row is a header row instead of a data row (default: false)
fileloader.csv.escapeset a custom escape character (default: ‘')
fileloader.csv.dateFormatsets the string that indicates a date format. Custom date formats follow the Datetime Patterns. (default: ‘yyyy-MM-dd’)
fileloader.csv.timestampFormatsets the string that indicates a timestamp format. Custom date formats follow the Datetime Patterns. (default: ‘yyyy-MM-dd’T’HH:mm:ss[.SSS][XXX]’)
fileloader.csv.timestampNTZFormatsets the string that indicates a timestamp without timezone format. Custom date formats follow the Datetime Patterns. (default: ‘yyyy-MM-dd’T’HH:mm:ss[.SSS]’)
fileloader.xml.rowTagname of the tag without angle braces to be used as the source for each row in the table, the default is ROW which would match a <ROW> tag in the source XML file