Use DSBULK to load, unload and count data on Aiven service for Cassandra®
DSBulk is a highly configurable tool used to load, unload and count data in Apache Cassandra®. It has configurable consistency levels for loading and unloading and offers the most accurate way to count records in Cassandra.
Prerequisites
To install the latest release of DSBulk, download the .zip
or
.tar.gz
file from the DSBulk GitHub
repository.
You can read more about the DSBulk different use cases and manual pages in the dedicated documentation
Variables
These are the placeholders you will need to replace in the code sample:
Variable | Description |
---|---|
PASSWORD | Password of the avnadmin user |
HOST | Host name for the connection |
PORT | Port number to use for the Cassandra service |
SSL_CERTFILE | Path of the CA Certificate for the Cassandra service |
KEYSTORE_PASSWORD | Password to secure your keystore. |
Most of the above variables and the CA Certificate file can be found in Aiven Console, in the service detail page.
Preparation of the environment
In order for dsbulk
to read the security certificate to connect to
Aiven service for Cassandra, the certificate must be imported in a
truststore.
-
Go to Aiven Console and download the certificate from the Overview page of your Aiven for Apache Cassandra service. Save the CA certificate in a file called
cassandra-certificate.pem
in a directory on the Linux system wheredsbulk
runs. -
Create a truststore file and import the certificate in it:
keytool -import -v \
-trustcacerts \
-alias CARoot \
-file cassandra-certificate.pem \
-keystore client.truststore \
-storepass KEYSTORE_PASSWORDA truststore file called
client.truststore
is created in the directory where thekeytool
command has been launched.The
keytool
command assumes the filecassandra-certificate.pem
is in the same directory where you runkeytool
. If that is not the case, provide a full path tocassandra-certificate.pem
. -
The next step is to create a configuration file with the connection information.
By creating a configuration file, the
dsbulk
command line is more readable and it doesn't show passwords in clear text. If you don't create a configuration file, every option must be explicitly provided on the command line. -
Create a file that contains the connection configuration like the following:
datastax-java-driver {
advanced {
ssl-engine-factory {
keystore-password = "cassandra"
keystore-path = "/home/user1/client.truststore"
class = DefaultSslEngineFactory
truststore-password = "cassandra"
truststore-path = "/home/user1/client.truststore"
}
auth-provider {
username = avnadmin
password = PASSWORD
}
}
}The DSBulk configuration file can contain many different blocks for different configurations. In the above example, it only the
datastax-java-driver
block is filled. Thessl-engine-factory
block contains the path of the truststore and the related password.tipThe Cassandra documentation has both full reference and templates of the application configuration file and a full reference of the driver configuration file.
Run a dsbulk
command to count records in a Cassandra table
Once the configuration file is created, you can run the dsbulk
.
-
Go to the
bin
subdirectory of the downloadeddsbulk
package. -
Run the following command:
./dsbulk count \
-f /full/path/to/conf.file \
-k baselines \
-t keyvalue \
-h HOST \
-port PORT \
--log.verbosity 2where:
baselines
andkeyvalue
are the names of the example keyspace and table in the Cassandra database.log.verbosity
controls the amount of logging that is sent at standard output whendsbulk
runs.verbosity=2
is used only to troubleshoot problems. To reduce verbosity, reduce the number to 1 or remove the option altogether.-f
specifies the path to the configuration file-h
and-p
are the hostname and port number to connect to Cassandra.
Extract data from a Cassandra table in CSV format
To extract the data from a table, you can use the following command:
./dsbulk unload \
-f /full/path/to/conf.file \
-k baselines \
-t keyvalue \
-h HOST \
-port PORT \
-url /directory_for_output
This command will extract all records from the table and output in a CSV
format to the directory specified in the -url
parameter.
Load data into a Cassandra table from a CSV file
To load data into a Cassandra table, the command line is very similar to the previous command:
./dsbulk load \
-f /full/path/to/conf.file \
-k baselines \
-t keyvalue \
-h HOST \
-port PORT \
-url data.csv
where the file data.csv
is the file that contains the data to load
into Cassandra.