How can I install Apache Cassandra on CentOS 8 Linux machine?. Apache Cassandra is a free and open-source NoSQL database management system designed to be distributed and highly available. Cassandra can handle large amounts of data across many commodity servers without any single point of failure.

This guide will walk you through the installation of Cassandra on CentOS 8. After installation is done, we’ll proceed to do configurations and tuning of Cassandra to work with machines having minimal resources available.

Features of Cassandra

Cassandra provides the Cassandra Query Language (CQL), an SQL-like language,
to create and update database schema and access data. CQL allows users to
organize data within a cluster of Cassandra nodes using:

  • Keyspace: defines how a dataset is replicated, for example in which
    datacenters and how many copies. Keyspaces contain tables.
  • Table: defines the typed schema for a collection of partitions. Cassandra
    tables have flexible addition of new columns to tables with zero downtime.
    Tables contain partitions, which contain partitions, which contain columns.
  • Partition: defines the mandatory part of the primary key all rows in
    Cassandra must have. All performant queries supply the partition key in
    the query.
  • Row: contains a collection of columns identified by a unique primary key
    made up of the partition key and optionally additional clustering keys.
  • Column: A single datum with a type which belong to a row.

Cassandra has support for the following client drivers:

  • Java
  • Python
  • Ruby
  • C# / .NET
  • Nodejs
  • PHP
  • C
  • Scala
  • Clojure
  • Erlang
  • Go
  • Haskell
  • Rust
  • Perl
  • Elixir
  • Dart

Install Apache Cassandra on CentOS 8

Java is required for running Cassandra on CentOS 8. As of this writing, required version of Java is 8. If you want to use cqlsh, you need the latest version of Python 2.7.

Step 1: Install Java 8 and Python:

sudo yum -y install epel-release python2 java-1.8.0-openjdk-devel

Confirm the installation of Java and Python.

$ java -version
openjdk version "1.8.0_242"
OpenJDK Runtime Environment (build 1.8.0_242-b08)
OpenJDK 64-Bit Server VM (build 25.242-b08, mixed mode)

$ python2.7 --version
Python 2.7.16

Step 2: Install Apache Cassandra on CentOS 8

Now that Java and Python are installed. Let’s now add Cassandra repository to our CentOS system.

sudo tee  /etc/yum.repos.d/cassandra.repo <<EOF
[cassandra]
name=Apache Cassandra
baseurl=https://www.apache.org/dist/cassandra/redhat/311x/
gpgcheck=1
repo_gpgcheck=1
gpgkey=https://www.apache.org/dist/cassandra/KEYS
EOF

Install Apache Cassandra with the command below.

sudo yum -y install cassandra

Create Cassandra service.

sudo tee /etc/systemd/system/cassandra.service<<EOF
[Unit]
Description=Apache Cassandra
After=network.target

[Service]
PIDFile=/var/run/cassandra/cassandra.pid
User=cassandra
Group=cassandra
ExecStart=/usr/sbin/cassandra -f -p /var/run/cassandra/cassandra.pid
Restart=always

[Install]
WantedBy=multi-user.target
EOF

Start and enable service to start at boot.

sudo systemctl daemon-reload
sudo systemctl start cassandra.service
sudo systemctl enable cassandra

Check service status:

$ systemctl status cassandra.service
● cassandra.service - Apache Cassandra
   Loaded: loaded (/etc/systemd/system/cassandra.service; disabled; vendor preset: disabled)
   Active: active (running) since Wed 2020-03-04 22:24:31 EAT; 2s ago
 Main PID: 8758 (java)
    Tasks: 10 (limit: 26213)
   Memory: 3.9G
   CGroup: /system.slice/cassandra.service
           └─8758 java -Xloggc:/var/log/cassandra/gc.log -ea -XX: UseThreadPriorities -XX:ThreadPriorityPolicy=42 -XX: HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=1000003 -XX: AlwaysPreTouch -XX:-Us>

Mar 04 22:24:31 cent8.localdomain systemd[1]: Started Apache Cassandra.

You can also verify that Cassandra is running with the command below.

$ nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
UN  127.0.0.1  70 KiB     256          100.0%            0daf41fa-22e5-4471-bc00-9aed6f566235  rack1

To run a query against Cassandra, invoke the CQL shell with below command.

$ cqlsh
Connected to Test Cluster at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 3.11.6 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.
cqlsh> 
  • The default location of configuration files is /etc/cassandra.
  • The default location of log and data directories is /var/log/cassandra/ and /var/lib/cassandra.

Configuring Cassandra

For running Cassandra on a single node, the default configuration file present at /etc/cassandra/conf/cassandra.yaml. For cluster of nodes setup, you may need to modify this file to ensure your cluster is tuned properly.

At a minimum you
should consider setting the following properties:

  • cluster_name: the name of your cluster.
  • seeds: a comma separated list of the IP addresses of your cluster seeds.
  • storage_port: you don’t necessarily need to change this but make sure that there are no firewalls blocking this port.
  • listen_address: the IP address of your node, this is what allows other nodes to communicate with this node so it is important that you change it.
  • native_transport_port: as for storage_port, make sure this port is not blocked by firewalls as clients will communicate with Cassandra on this port.

Changing the location of directories

The configuration yaml file controls the following data directories.

  • data_file_directories: one or more directories where data files are located.
  • commitlog_directory: the directory where commitlog files are located.
  • saved_caches_directory: the directory where saved caches are located.
  • hints_directory: the directory where hints are located.

For performance reasons, if you have multiple disks, consider putting commitlog and data files on different disks.

Setting Environment variables

The JVM level settings such as heap size are set in the cassandra-env.sh. Consider adding any additional JVM command line argument to the JVM_OPTS environment variable. These arguments are passed to Cassandra service when it starts.

Cassandra Logging

The logger in use is logback. You can change logging properties by editing logback.xml. By default it will log at INFO level into a file called system.log and at debug level into a file calle debug.log. When running in the foreground, it will also log at INFO level to the console.

Refer to official guide for Clients configuration.