Use DSBULK to load, unload and count data on Aiven service for Cassandra®
DSBulk is a highly configurable tool used to load, unload and count data in Apache Cassandra®. It has configurable consistency levels for loading and unloading and offers the most accurate way to count records in Cassandra.
Prerequisites
To install the latest release of DSBulk, download the .zip or
.tar.gz file from the DSBulk GitHub
repository.
You can read more about the DSBulk different use cases and manual pages in the dedicated documentation
Variables
These are the placeholders you will need to replace in the code sample:
| Variable | Description |
|---|---|
PASSWORD | Password of the avnadmin user |
HOST | Host name for the connection |
PORT | Port number to use for the Cassandra service |
SSL_CERTFILE | Path of the CA Certificate for the Cassandra service |
KEYSTORE_PASSWORD | Password to secure your keystore. |
Most of the above variables and the CA Certificate file can be found in Aiven Console, in the service detail page.
Preparation of the environment
In order for dsbulk to read the security certificate to connect to
Aiven service for Cassandra, the certificate must be imported in a
truststore.
-
Go to Aiven Console and download the certificate from the Overview page of your Aiven for Apache Cassandra service. Save the CA certificate in a file called
cassandra-certificate.pemin a directory on the Linux system wheredsbulkruns. -
Create a truststore file and import the certificate in it:
keytool -import -v \
-trustcacerts \
-alias CARoot \
-file cassandra-certificate.pem \
-keystore client.truststore \
-storepass KEYSTORE_PASSWORDA truststore file called
client.truststoreis created in the directory where thekeytoolcommand has been launched.The
keytoolcommand assumes the filecassandra-certificate.pemis in the same directory where you runkeytool. If that is not the case, provide a full path tocassandra-certificate.pem. -
The next step is to create a configuration file with the connection information.
By creating a configuration file, the
dsbulkcommand line is more readable and it doesn't show passwords in clear text. If you don't create a configuration file, every option must be explicitly provided on the command line. -
Create a file that contains the connection configuration like the following:
datastax-java-driver {
advanced {
ssl-engine-factory {
keystore-password = "cassandra"
keystore-path = "/home/user1/client.truststore"
class = DefaultSslEngineFactory
truststore-password = "cassandra"
truststore-path = "/home/user1/client.truststore"
}
auth-provider {
username = avnadmin
password = PASSWORD
}
}
}The DSBulk configuration file can contain many different blocks for different configurations. In the above example, it only the
datastax-java-driverblock is filled. Thessl-engine-factoryblock contains the path of the truststore and the related password.tipThe Cassandra documentation has both full reference and templates of the application configuration file and a full reference of the driver configuration file.
Run a dsbulk command to count records in a Cassandra table
Once the configuration file is created, you can run the dsbulk.
-
Go to the
binsubdirectory of the downloadeddsbulkpackage. -
Run the following command:
./dsbulk count \
-f /full/path/to/conf.file \
-k baselines \
-t keyvalue \
-h HOST \
-port PORT \
--log.verbosity 2where:
baselinesandkeyvalueare the names of the example keyspace and table in the Cassandra database.log.verbositycontrols the amount of logging that is sent at standard output whendsbulkruns.verbosity=2is used only to troubleshoot problems. To reduce verbosity, reduce the number to 1 or remove the option altogether.-fspecifies the path to the configuration file-hand-pare the hostname and port number to connect to Cassandra.
Extract data from a Cassandra table in CSV format
To extract the data from a table, you can use the following command:
./dsbulk unload \
-f /full/path/to/conf.file \
-k baselines \
-t keyvalue \
-h HOST \
-port PORT \
-url /directory_for_output
This command will extract all records from the table and output in a CSV
format to the directory specified in the -url parameter.
Load data into a Cassandra table from a CSV file
To load data into a Cassandra table, the command line is very similar to the previous command:
./dsbulk load \
-f /full/path/to/conf.file \
-k baselines \
-t keyvalue \
-h HOST \
-port PORT \
-url data.csv
where the file data.csv is the file that contains the data to load
into Cassandra.