S3 sink connector by Aiven naming and data formats
The Apache Kafka Connect® S3 sink connector by Aiven enables you to move data from an Aiven for Apache Kafka® cluster to Amazon S3 for long term storage. The following document describes advanced parameters defining the naming and data formats.
Aiven provides two version of S3 sink connector: one developed by Aiven, another developed by Confluent.
This article is about the Aiven version. Documentation for the Confluent version is available at the dedicated page.
S3 naming format
The Apache Kafka Connect® S3 sink connector by Aiven stores a series of files as objects in the specified S3 bucket. By default, each object is named using the pattern:
[](AWS_S3_PREFIX><TOPIC_NAME>-<PARTITION_NUMBER>-<START_OFFSET>.<FILE_EXTENSION)
The placeholders are the following:
AWS_S3_PREFIX
: Can be any string and use placeholders like{{ utc_date }}
and{{ local_date }}
to create different files for each date.TOPIC_NAME
: Name of the topic to be pushed to S3PARTITION_NUMBER
: Topic partitions numberSTART_OFFSET
: File starting offsetFILE_EXTENSION
: The file extension depends on the compression defined in thefile.compression.type
parameter. Thegz
extension is generated when using thegzip
compression.
You can customise the S3 object naming and how records are grouped into files, see the documentation in GitHub on the file naming format for further details.
The Apache Kafka Connect® connector creates one file per period defined
by the offset.flush.interval.ms
parameter. The file is generated for
every partition that has received at least one new message during the
period. The setting defaults to 60 seconds.
S3 data format
By default, data is stored in one record per line in S3 in CSV format.
You can change the output data format to JSON or
Parquet by setting the
format.output.type
. More details can be found in the GitHub connector
repository
documentation
You can define the output data fields with the format.output.fields
connector configuration. The message key and value, if included in the
output, are encoded in Base64
.
For example, setting format.output.fields
to value,key,timestamp
results in rows in the S3 files like the following:
bWVzc2FnZV9jb250ZW50,cGFydGl0aW9uX2tleQ==,1511801218777
You can disable the Base64
encoding by setting the
format.output.fields.value.encoding
to none