AuthTLSSnapshotsMedusaScyllaDBC++ Rewrite

Security, Backup & ScyllaDB

Production Cassandra requires authentication, encryption, and backup strategies. Understanding the security model and disaster recovery options is essential for operating at scale.

45 min read9 sections
01

Authentication & Authorization

By default, Cassandra ships with authentication and authorization disabled — anyone can connect and do anything. Production clusters must enable both. Cassandra uses a role-based access control (RBAC) model where roles can be granted permissions on specific resources.

cassandra.yamlyaml
# Enable authentication (default: AllowAllAuthenticator)
authenticator: PasswordAuthenticator

# Enable authorization (default: AllowAllAuthorizer)
authorizer: CassandraAuthorizer

# Role management
role_manager: CassandraRoleManager

# Password hashing rounds (higher = more secure, slower login)
credentials_validity_in_ms: 2000
credentials_update_interval_in_ms: 1000
rbac-setup.cqlsql
-- Default superuser (change password immediately!)
-- Username: cassandra, Password: cassandra
ALTER USER cassandra WITH PASSWORD 'new_secure_password';

-- Create application roles
CREATE ROLE app_readwrite WITH PASSWORD = 'strong_pass_123'
  AND LOGIN = true;

CREATE ROLE app_readonly WITH PASSWORD = 'strong_pass_456'
  AND LOGIN = true;

CREATE ROLE admin_role WITH PASSWORD = 'admin_pass_789'
  AND LOGIN = true AND SUPERUSER = true;

-- Grant permissions
GRANT SELECT ON KEYSPACE my_app TO app_readonly;
GRANT SELECT, MODIFY ON KEYSPACE my_app TO app_readwrite;
GRANT ALL PERMISSIONS ON ALL KEYSPACES TO admin_role;

-- Fine-grained permissions
GRANT SELECT ON TABLE my_app.users TO app_readonly;
GRANT MODIFY ON TABLE my_app.sessions TO app_readwrite;

-- Revoke permissions
REVOKE MODIFY ON KEYSPACE my_app FROM app_readonly;

-- List permissions
LIST ALL PERMISSIONS OF app_readwrite;
PermissionApplies ToDescription
SELECTKeyspace, TableRead data
MODIFYKeyspace, TableINSERT, UPDATE, DELETE
CREATEKeyspace, TableCreate new keyspaces/tables
ALTERKeyspace, TableModify schema
DROPKeyspace, TableDelete keyspaces/tables
AUTHORIZEAny resourceGrant/revoke permissions to others
ALL PERMISSIONSAny resourceAll of the above

Change Default Credentials First

The default superuser (cassandra/cassandra) is well-known. The first step after enabling authentication is changing this password and creating a new superuser role. Then disable or restrict the default cassandra user. Automated scanners actively probe for default Cassandra credentials.

02

Encryption (TLS & At Rest)

Cassandra supports two types of TLS encryption: client-to-node (protecting data in transit from applications) and node-to-node (protecting inter-cluster communication including gossip, streaming, and replication).

cassandra.yamlyaml
# Client-to-node encryption (application → Cassandra)
client_encryption_options:
  enabled: true
  optional: false          # Require TLS (reject unencrypted connections)
  keystore: /etc/cassandra/keystore.jks
  keystore_password: changeit
  truststore: /etc/cassandra/truststore.jks
  truststore_password: changeit
  protocol: TLS            # TLSv1.2 minimum recommended
  cipher_suites:
    - TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
    - TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
  require_client_auth: false  # Set true for mutual TLS

# Node-to-node encryption (Cassandra ↔ Cassandra)
server_encryption_options:
  internode_encryption: all  # Options: none, dc, rack, all
  keystore: /etc/cassandra/keystore.jks
  keystore_password: changeit
  truststore: /etc/cassandra/truststore.jks
  truststore_password: changeit
  require_client_auth: true  # Mutual TLS between nodes
Encryption TypeProtectsConfiguration
Client-to-node TLSApp ↔ Cassandra communicationclient_encryption_options
Node-to-node TLSInter-node gossip, streaming, replicationserver_encryption_options
At-rest encryptionData files on disk (SSTables, commit log)Transparent Data Encryption (TDE)

Encryption at Rest

Cassandra 5.0+ supports Transparent Data Encryption (TDE) for SSTables and commit logs. For earlier versions, use filesystem-level encryption (LUKS, dm-crypt) or cloud provider encryption (AWS EBS encryption, GCP disk encryption).

Encryption Best Practices

  • āœ…Enable both client-to-node AND node-to-node TLS in production
  • āœ…Use TLSv1.2 or higher — disable SSLv3 and TLSv1.0/1.1
  • āœ…Rotate certificates before expiry — automate with cert-manager or similar
  • āœ…Use mutual TLS (mTLS) for node-to-node to prevent rogue nodes joining
  • āœ…For at-rest: use filesystem encryption (LUKS) or cloud-native disk encryption
03

Network Security

Cassandra uses several network ports for different purposes. Proper firewall configuration is essential — only expose the minimum required ports to the minimum required networks.

PortPurposeAccess Scope
9042CQL native transport (client connections)Application servers only
9142CQL native transport with TLSApplication servers only
7000Inter-node communication (gossip, streaming)Cassandra nodes only
7001Inter-node communication with TLSCassandra nodes only
7199JMX monitoring (nodetool)Admin/monitoring hosts only
9160Thrift (legacy, deprecated)Disable entirely
firewall-rules.txttext
Network Security Architecture:

ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│ Application Layer (VPC/Subnet A)                        │
│                                                         │
│  [App Server 1]  [App Server 2]  [App Server 3]        │
│       │               │               │                 │
│       ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜                 │
│                       │ Port 9042/9142 only             │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜
                        │
                   ā”Œā”€ā”€ā”€ā”€ā–¼ā”€ā”€ā”€ā”€ā” (Security Group / Firewall)
                   │ Allow:  │
                   │ 9042    │ from app subnet
                   │ 7000    │ from cassandra subnet
                   │ 7199    │ from admin subnet
                   ā””ā”€ā”€ā”€ā”€ā”¬ā”€ā”€ā”€ā”€ā”˜
                        │
ā”Œā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”¼ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”
│ Cassandra Layer (VPC/Subnet B)                          │
│                       │                                  │
│  [Cass Node 1] ◄────►[Cass Node 2] ◄────► [Cass Node 3]│
│       Port 7000/7001 (inter-node)                       │
ā””ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”€ā”˜

Rules:
  āœ… 9042: Only from application subnet
  āœ… 7000/7001: Only between Cassandra nodes
  āœ… 7199: Only from admin/monitoring hosts
  āŒ Never expose 7000/7199 to the internet
  āŒ Never expose 9042 to the internet (use VPN/bastion)

Bind Address Configuration

Configure listen_address (for inter-node) and rpc_address (for client connections) to bind to private IPs only. Never bind to 0.0.0.0 in production unless behind a firewall. For multi-DC, use broadcast_address for the public/cross-DC IP and listen_address for the private/intra-DC IP.

04

Snapshots & Backup

Cassandra snapshots create instant, zero-copy backups using filesystem hardlinks. Because SSTables are immutable, a snapshot is just a set of hardlinks to existing files — it completes in milliseconds regardless of data size.

1

Flush Memtables

nodetool snapshot first flushes all memtables to SSTables (ensures all data is on disk)

2

Create Hardlinks

Creates hardlinks to all SSTable files in a snapshot directory — instant, no data copied

3

Snapshot Complete

Snapshot directory contains hardlinks. Original SSTables can be compacted without affecting the snapshot.

4

Upload to Remote Storage

Copy snapshot files to S3/GCS for off-node backup (this is the slow part)

backup-commands.shbash
# Take a snapshot of all keyspaces
nodetool snapshot -t daily_backup_2024_01_15

# Take a snapshot of a specific keyspace
nodetool snapshot my_keyspace -t my_snapshot

# Snapshot location:
# <data_dir>/<keyspace>/<table>/snapshots/<snapshot_name>/

# List snapshots
nodetool listsnapshots

# Clear a snapshot (free disk space)
nodetool clearsnapshot -t daily_backup_2024_01_15

# Restore from snapshot:
# 1. Stop Cassandra
# 2. Clear commitlog and data directories
# 3. Copy snapshot SSTable files to the table's data directory
# 4. Start Cassandra
# 5. Run nodetool repair to ensure consistency

Backup Tools

ToolTypeFeatures
nodetool snapshotBuilt-inLocal hardlinks, manual upload to S3
Medusa (Spotify)Open sourceAutomated S3/GCS backup, point-in-time restore, cluster-wide coordination
Priam (Netflix)Open sourceAWS-focused, automated backup/restore, token management
Instaclustr ShotoverCommercialContinuous backup, minimal RPO
Cloud provider snapshotsInfrastructureEBS/disk snapshots (not Cassandra-aware)

Snapshots Are Per-Node

A snapshot only captures data on the local node. For a full cluster backup, you must snapshot every node. Tools like Medusa coordinate this across the cluster and upload to object storage. For restore, you need snapshots from enough nodes to cover all token ranges (at least RF nodes per range).

05

Disaster Recovery

Cassandra's multi-DC replication is the primary disaster recovery mechanism. With data replicated across DCs, a full DC failure is survivable without any restore process — the remaining DCs continue serving traffic immediately.

ScenarioRecovery MethodRPORTO
Single node failureAutomatic (replicas serve traffic)00 (instant)
Rack failureAutomatic (cross-rack replicas)00 (instant)
Full DC failure (multi-DC)Automatic (other DCs serve traffic)~100ms0 (instant)
Full DC failure (single-DC)Restore from backupMinutes to hoursHours
Data corruption (logical)Point-in-time restore from backupDepends on backup frequencyHours
Accidental deletionRestore specific table from snapshotLast snapshot timeMinutes to hours
dr-strategy.txttext
Disaster Recovery Strategy (Multi-DC):

Primary defense: Multi-DC replication
  - RF=3 in each of 2+ DCs
  - LOCAL_QUORUM ensures each DC is self-sufficient
  - DC failure = automatic failover (no action needed)

Secondary defense: Regular backups
  - Daily snapshots uploaded to S3/GCS (cross-region)
  - Medusa for automated, coordinated cluster backup
  - Retain 7-30 days of snapshots

Recovery procedures:
  1. Single node: replace with new node, run repair
  2. Multiple nodes: replace, rebuild from remaining replicas
  3. Full DC: add new DC, run nodetool rebuild
  4. Logical corruption: restore from last known-good snapshot
  5. Complete loss: restore all nodes from backup, repair

Testing:
  - Quarterly DR drills (restore from backup to test cluster)
  - Chaos engineering (kill nodes/racks, verify automatic recovery)
  - Validate backup integrity (restore and query sample data)

Multi-DC Is Your Best DR

If you can afford it, multi-DC replication is far superior to backup/restore for DR. Recovery is instant (zero RTO), data loss is minimal (sub-second RPO), and no manual intervention is needed. Backups are still necessary for logical corruption (bad data written to all replicas) but not for infrastructure failures.

06

ScyllaDB — The C++ Alternative

ScyllaDB is a ground-up C++ rewrite of Cassandra built on the Seastar framework. It eliminates the JVM entirely, using a shard-per-core architecture where each CPU core operates independently with its own memory, I/O, and data. The result is 5-10x higher throughput with consistent, low tail latency.

ScyllaDB Key Advantages

  • āœ…No GC pauses — C++ with manual memory management, no stop-the-world events
  • āœ…Shard-per-core — each CPU core is independent, no locks, no shared state
  • āœ…5-10x throughput per node — fewer nodes needed for same workload
  • āœ…Consistent tail latency — p99 stays low without GC spikes
  • āœ…Drop-in compatible — same CQL, same drivers, same tools (nodetool equivalent: scylla-tools)
  • āœ…Automatic tuning — self-optimizing I/O scheduler and memory allocation
scylladb-migration.txttext
Migrating from Cassandra to ScyllaDB:

1. Schema: Export with cqlsh, import directly (100% CQL compatible)
2. Data: Use Spark migrator, sstableloader, or ScyllaDB Migrator
3. Drivers: Same Cassandra drivers work unchanged
4. Operations: scylla replaces cassandra process
              nodetool works the same (or use scylla-tools)
5. Configuration: scylla.yaml similar to cassandra.yaml

Key differences in operation:
  - No JVM tuning needed (no heap, no GC configuration)
  - CPU pinning is automatic (shard-per-core)
  - I/O scheduler is self-tuning
  - Compaction runs per-shard (parallel, no global lock)
  - Repair is faster (parallel, per-shard)

Typical migration result:
  Cassandra: 12 nodes, i3.2xlarge, 8 GB heap each
  ScyllaDB:  3 nodes, i3.2xlarge (same hardware, 4x fewer nodes)
  Same throughput, lower p99 latency

Choose ScyllaDB When

  • Tail latency (p99) is critical
  • Want fewer nodes (lower infrastructure cost)
  • GC pauses are causing issues in Cassandra
  • Starting a new project (no migration needed)
  • Need higher throughput per node

Stay with Cassandra When

  • Large existing Cassandra investment
  • Team has deep Cassandra expertise
  • Need Apache governance / community
  • Using Cassandra-specific features (MVs, SASI)
  • Prefer fully open-source (no enterprise tier)
07

Cassandra vs Alternatives

Choosing the right database depends on your access patterns, consistency requirements, operational capabilities, and scale. Here's how Cassandra compares to common alternatives.

AspectCassandraDynamoDBPostgreSQLMongoDB
ArchitectureMasterless ringManaged (hidden)Primary/ReplicaPrimary/Secondary
ConsistencyTunable (AP default)Tunable (eventual/strong)Strong (ACID)Tunable (eventual/strong)
Scale modelHorizontal (add nodes)Automatic (managed)Vertical + read replicasHorizontal (sharding)
Write throughputExcellent (linear scale)Excellent (managed)Good (single primary)Good (sharded)
Query flexibilityLow (CQL, no joins)Low (key-value, no joins)High (full SQL, joins)Medium (rich queries, no joins)
OperationsComplex (self-managed)Zero (fully managed)ModerateModerate
Multi-DCNative (active-active)Global TablesLogical replicationAtlas Global Clusters
Cost at scaleLow (open source + infra)High (per-request pricing)ModerateModerate to high
Best forWrite-heavy, known patternsServerless, AWS-nativeComplex queries, ACIDFlexible schema, moderate scale

When to Choose Cassandra

Cassandra Is the Right Choice When

  • āœ…Write-heavy workloads (IoT, time-series, event logging, messaging)
  • āœ…Multi-DC active-active is required (global presence, zero-downtime DR)
  • āœ…Linear scalability needed (double nodes = double throughput)
  • āœ…Access patterns are known and stable (query-first design is acceptable)
  • āœ…High availability is more important than strong consistency
  • āœ…You have operational expertise (or will invest in it)

Cassandra Is the Wrong Choice When

  • āŒAd-hoc queries and analytics are primary use case (use PostgreSQL + analytics DB)
  • āŒStrong consistency across entities is required (use PostgreSQL)
  • āŒTeam lacks distributed systems expertise (use managed service like DynamoDB)
  • āŒSmall scale where operational complexity isn't justified (use PostgreSQL)
  • āŒNeed full-text search (use Elasticsearch alongside Cassandra)
  • āŒFrequent schema changes and evolving access patterns (use MongoDB)
08

Interview Questions

Q:How does Cassandra handle authentication and authorization?

A: Authentication via PasswordAuthenticator (username/password stored in system_auth keyspace, bcrypt hashed). Authorization via CassandraAuthorizer with role-based access control (RBAC). Roles can be granted specific permissions (SELECT, MODIFY, CREATE, etc.) on specific resources (keyspaces, tables). Default credentials (cassandra/cassandra) must be changed immediately in production.

Q:What backup strategies are available for Cassandra?

A: (1) nodetool snapshot: instant hardlink-based local backup (zero-copy, milliseconds). Must be done on every node. (2) Medusa: automated cluster-wide backup to S3/GCS with point-in-time restore. (3) Incremental backup: copies each new SSTable as it's flushed (continuous but complex to manage). (4) Multi-DC replication: best DR strategy — instant failover, zero RPO for infrastructure failures.

Q:Compare Cassandra and DynamoDB — when would you choose each?

A: Choose Cassandra when: you need multi-DC active-active, want to avoid vendor lock-in, have operational expertise, or need to control costs at massive scale (open source). Choose DynamoDB when: you want zero operations (fully managed), are AWS-native, need automatic scaling, or lack distributed systems expertise. DynamoDB is simpler but more expensive at scale and locked to AWS.

Q:How does ScyllaDB achieve better performance than Cassandra?

A: Three key architectural differences: (1) C++ instead of Java — no garbage collection pauses. (2) Shard-per-core (Seastar framework) — each CPU core owns its data independently, no locks or shared state. (3) Userspace I/O scheduling — bypasses kernel I/O scheduler for predictable latency. Result: 5-10x throughput per node with consistent tail latency. Same CQL protocol — existing drivers work unchanged.

Q:What network ports does Cassandra use and how should they be secured?

A: 9042: CQL client connections (expose only to app servers). 7000/7001: inter-node communication (Cassandra nodes only, never external). 7199: JMX/nodetool (admin hosts only). 9160: Thrift (deprecated, disable). Security: use private subnets, security groups restricting each port to minimum required sources. Enable TLS on both client-to-node (9042) and node-to-node (7000) in production.

09

Common Mistakes

šŸ”“

Running production without authentication enabled

Cassandra ships with AllowAllAuthenticator — anyone can connect without credentials. Automated scanners find exposed Cassandra ports within hours and can read/delete all data.

āœ…Enable PasswordAuthenticator and CassandraAuthorizer in cassandra.yaml. Change default credentials immediately. Use network security (firewalls) as defense-in-depth, not the only protection.

šŸ“”

Exposing JMX port (7199) to the network

JMX allows full administrative control — decommission nodes, drop tables, read all data. If exposed without authentication, anyone can destroy the cluster.

āœ…Bind JMX to localhost only (or use JMX authentication). Access via SSH tunnel or bastion host. Never expose 7199 beyond the admin network.

šŸ’¾

Relying only on multi-DC replication for backup

Multi-DC protects against infrastructure failures but not logical corruption. A bad application deploy that writes corrupt data replicates the corruption to all DCs instantly.

āœ…Maintain regular snapshots (daily) uploaded to object storage in addition to multi-DC replication. Snapshots protect against logical corruption, accidental deletes, and bad deploys.

šŸ”

Not enabling node-to-node encryption

Without inter-node TLS, gossip, streaming, and replication traffic is plaintext. An attacker on the network can read all data in transit and potentially inject rogue nodes into the cluster.

āœ…Enable server_encryption_options with internode_encryption: all and require_client_auth: true (mutual TLS). This prevents both eavesdropping and unauthorized nodes joining the cluster.

šŸ“‹

Not testing backup restore procedures

Taking snapshots but never testing restore. When disaster strikes, teams discover their backups are incomplete, corrupted, or the restore process takes longer than expected.

āœ…Quarterly DR drills: restore from backup to a test cluster, verify data integrity by querying sample data, measure actual RTO. Automate the restore process and document it.