Skip to main content

Use DSBULK to load, unload and count data on Aiven service for Cassandra®

DSBulk is a highly configurable tool used to load, unload and count data in Apache Cassandra®. It has configurable consistency levels for loading and unloading and offers the most accurate way to count records in Cassandra.

Prerequisites

To install the latest release of DSBulk, download the .zip or .tar.gz file from the DSBulk GitHub repository.

tip

You can read more about the DSBulk different use cases and manual pages in the dedicated documentation

Variables

These are the placeholders you will need to replace in the code sample:

VariableDescription
PASSWORDPassword of the avnadmin user
HOSTHost name for the connection
PORTPort number to use for the Cassandra service
SSL_CERTFILEPath of the CA Certificate for the Cassandra service
KEYSTORE_PASSWORDPassword to secure your keystore.
tip

Most of the above variables and the CA Certificate file can be found in Aiven Console, in the service detail page.

Preparation of the environment

In order for dsbulk to read the security certificate to connect to Aiven service for Cassandra, the certificate must be imported in a truststore.

  1. Go to Aiven Console and download the certificate from the Overview page of your Aiven for Apache Cassandra service. Save the CA certificate in a file called cassandra-certificate.pem in a directory on the Linux system where dsbulk runs.

  2. Create a truststore file and import the certificate in it:

    keytool -import -v                \
    -trustcacerts \
    -alias CARoot \
    -file cassandra-certificate.pem \
    -keystore client.truststore \
    -storepass KEYSTORE_PASSWORD

    A truststore file called client.truststore is created in the directory where the keytool command has been launched.

    The keytool command assumes the file cassandra-certificate.pem is in the same directory where you run keytool. If that is not the case, provide a full path to cassandra-certificate.pem.

  3. The next step is to create a configuration file with the connection information.

    By creating a configuration file, the dsbulk command line is more readable and it doesn't show passwords in clear text. If you don't create a configuration file, every option must be explicitly provided on the command line.

  4. Create a file that contains the connection configuration like the following:

    datastax-java-driver {
    advanced {
    ssl-engine-factory {
    keystore-password = "cassandra"
    keystore-path = "/home/user1/client.truststore"
    class = DefaultSslEngineFactory
    truststore-password = "cassandra"
    truststore-path = "/home/user1/client.truststore"
    }
    auth-provider {
    username = avnadmin
    password = PASSWORD
    }
    }
    }

    The DSBulk configuration file can contain many different blocks for different configurations. In the above example, it only the datastax-java-driver block is filled. The ssl-engine-factory block contains the path of the truststore and the related password.

Run a dsbulk command to count records in a Cassandra table

Once the configuration file is created, you can run the dsbulk.

  1. Go to the bin subdirectory of the downloaded dsbulk package.

  2. Run the following command:

    ./dsbulk count                      \
    -f /full/path/to/conf.file \
    -k baselines \
    -t keyvalue \
    -h HOST \
    -port PORT \
    --log.verbosity 2

    where:

    • baselines and keyvalue are the names of the example keyspace and table in the Cassandra database.
    • log.verbosity controls the amount of logging that is sent at standard output when dsbulk runs. verbosity=2 is used only to troubleshoot problems. To reduce verbosity, reduce the number to 1 or remove the option altogether.
    • -f specifies the path to the configuration file
    • -h and -p are the hostname and port number to connect to Cassandra.

Extract data from a Cassandra table in CSV format

To extract the data from a table, you can use the following command:

./dsbulk unload        \
-f /full/path/to/conf.file \
-k baselines \
-t keyvalue \
-h HOST \
-port PORT \
-url /directory_for_output

This command will extract all records from the table and output in a CSV format to the directory specified in the -url parameter.

Load data into a Cassandra table from a CSV file

To load data into a Cassandra table, the command line is very similar to the previous command:

./dsbulk load            \
-f /full/path/to/conf.file \
-k baselines \
-t keyvalue \
-h HOST \
-port PORT \
-url data.csv

where the file data.csv is the file that contains the data to load into Cassandra.