AuthTLSSnapshotsMedusaScyllaDBC++ Rewrite

Security, Backup & ScyllaDB

Production Cassandra requires authentication, encryption, and backup strategies. Understanding the security model and disaster recovery options is essential for operating at scale.

45 min read9 sections

Authentication & Authorization

By default, Cassandra ships with authentication and authorization disabled — anyone can connect and do anything. Production clusters must enable both. Cassandra uses a role-based access control (RBAC) model where roles can be granted permissions on specific resources.

cassandra.yamlyaml

# Enable authentication (default: AllowAllAuthenticator)
authenticator: PasswordAuthenticator

# Enable authorization (default: AllowAllAuthorizer)
authorizer: CassandraAuthorizer

# Role management
role_manager: CassandraRoleManager

# Password hashing rounds (higher = more secure, slower login)
credentials_validity_in_ms: 2000
credentials_update_interval_in_ms: 1000

rbac-setup.cqlsql

-- Default superuser (change password immediately!)
-- Username: cassandra, Password: cassandra
ALTER USER cassandra WITH PASSWORD 'new_secure_password';

-- Create application roles
CREATE ROLE app_readwrite WITH PASSWORD = 'strong_pass_123'
  AND LOGIN = true;

CREATE ROLE app_readonly WITH PASSWORD = 'strong_pass_456'
  AND LOGIN = true;

CREATE ROLE admin_role WITH PASSWORD = 'admin_pass_789'
  AND LOGIN = true AND SUPERUSER = true;

-- Grant permissions
GRANT SELECT ON KEYSPACE my_app TO app_readonly;
GRANT SELECT, MODIFY ON KEYSPACE my_app TO app_readwrite;
GRANT ALL PERMISSIONS ON ALL KEYSPACES TO admin_role;

-- Fine-grained permissions
GRANT SELECT ON TABLE my_app.users TO app_readonly;
GRANT MODIFY ON TABLE my_app.sessions TO app_readwrite;

-- Revoke permissions
REVOKE MODIFY ON KEYSPACE my_app FROM app_readonly;

-- List permissions
LIST ALL PERMISSIONS OF app_readwrite;

Permission	Applies To	Description
SELECT	Keyspace, Table	Read data
MODIFY	Keyspace, Table	INSERT, UPDATE, DELETE
CREATE	Keyspace, Table	Create new keyspaces/tables
ALTER	Keyspace, Table	Modify schema
DROP	Keyspace, Table	Delete keyspaces/tables
AUTHORIZE	Any resource	Grant/revoke permissions to others
ALL PERMISSIONS	Any resource	All of the above

Change Default Credentials First

The default superuser (cassandra/cassandra) is well-known. The first step after enabling authentication is changing this password and creating a new superuser role. Then disable or restrict the default cassandra user. Automated scanners actively probe for default Cassandra credentials.

Encryption (TLS & At Rest)

Cassandra supports two types of TLS encryption: client-to-node (protecting data in transit from applications) and node-to-node (protecting inter-cluster communication including gossip, streaming, and replication).

cassandra.yamlyaml

# Client-to-node encryption (application → Cassandra)
client_encryption_options:
  enabled: true
  optional: false          # Require TLS (reject unencrypted connections)
  keystore: /etc/cassandra/keystore.jks
  keystore_password: changeit
  truststore: /etc/cassandra/truststore.jks
  truststore_password: changeit
  protocol: TLS            # TLSv1.2 minimum recommended
  cipher_suites:
    - TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
    - TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
  require_client_auth: false  # Set true for mutual TLS

# Node-to-node encryption (Cassandra ↔ Cassandra)
server_encryption_options:
  internode_encryption: all  # Options: none, dc, rack, all
  keystore: /etc/cassandra/keystore.jks
  keystore_password: changeit
  truststore: /etc/cassandra/truststore.jks
  truststore_password: changeit
  require_client_auth: true  # Mutual TLS between nodes

Encryption Type	Protects	Configuration
Client-to-node TLS	App ↔ Cassandra communication	client_encryption_options
Node-to-node TLS	Inter-node gossip, streaming, replication	server_encryption_options
At-rest encryption	Data files on disk (SSTables, commit log)	Transparent Data Encryption (TDE)

Encryption at Rest

Cassandra 5.0+ supports Transparent Data Encryption (TDE) for SSTables and commit logs. For earlier versions, use filesystem-level encryption (LUKS, dm-crypt) or cloud provider encryption (AWS EBS encryption, GCP disk encryption).

Encryption Best Practices

✅Enable both client-to-node AND node-to-node TLS in production
✅Use TLSv1.2 or higher — disable SSLv3 and TLSv1.0/1.1
✅Rotate certificates before expiry — automate with cert-manager or similar
✅Use mutual TLS (mTLS) for node-to-node to prevent rogue nodes joining
✅For at-rest: use filesystem encryption (LUKS) or cloud-native disk encryption

Network Security

Cassandra uses several network ports for different purposes. Proper firewall configuration is essential — only expose the minimum required ports to the minimum required networks.

Port	Purpose	Access Scope
9042	CQL native transport (client connections)	Application servers only
9142	CQL native transport with TLS	Application servers only
7000	Inter-node communication (gossip, streaming)	Cassandra nodes only
7001	Inter-node communication with TLS	Cassandra nodes only
7199	JMX monitoring (nodetool)	Admin/monitoring hosts only
9160	Thrift (legacy, deprecated)	Disable entirely

firewall-rules.txttext

Network Security Architecture:

┌─────────────────────────────────────────────────────────┐
│ Application Layer (VPC/Subnet A)                        │
│                                                         │
│  [App Server 1]  [App Server 2]  [App Server 3]        │
│       │               │               │                 │
│       └───────────────┼───────────────┘                 │
│                       │ Port 9042/9142 only             │
└───────────────────────┼─────────────────────────────────┘
                        │
                   ┌────▼────┐ (Security Group / Firewall)
                   │ Allow:  │
                   │ 9042    │ from app subnet
                   │ 7000    │ from cassandra subnet
                   │ 7199    │ from admin subnet
                   └────┬────┘
                        │
┌───────────────────────┼─────────────────────────────────┐
│ Cassandra Layer (VPC/Subnet B)                          │
│                       │                                  │
│  [Cass Node 1] ◄────►[Cass Node 2] ◄────► [Cass Node 3]│
│       Port 7000/7001 (inter-node)                       │
└─────────────────────────────────────────────────────────┘

Rules:
  ✅ 9042: Only from application subnet
  ✅ 7000/7001: Only between Cassandra nodes
  ✅ 7199: Only from admin/monitoring hosts
  ❌ Never expose 7000/7199 to the internet
  ❌ Never expose 9042 to the internet (use VPN/bastion)

Bind Address Configuration

Configure listen_address (for inter-node) and rpc_address (for client connections) to bind to private IPs only. Never bind to 0.0.0.0 in production unless behind a firewall. For multi-DC, use broadcast_address for the public/cross-DC IP and listen_address for the private/intra-DC IP.

Snapshots & Backup

Cassandra snapshots create instant, zero-copy backups using filesystem hardlinks. Because SSTables are immutable, a snapshot is just a set of hardlinks to existing files — it completes in milliseconds regardless of data size.

Flush Memtables

nodetool snapshot first flushes all memtables to SSTables (ensures all data is on disk)

Create Hardlinks

Creates hardlinks to all SSTable files in a snapshot directory — instant, no data copied

Snapshot Complete

Snapshot directory contains hardlinks. Original SSTables can be compacted without affecting the snapshot.

Upload to Remote Storage

Copy snapshot files to S3/GCS for off-node backup (this is the slow part)

backup-commands.shbash

# Take a snapshot of all keyspaces
nodetool snapshot -t daily_backup_2024_01_15

# Take a snapshot of a specific keyspace
nodetool snapshot my_keyspace -t my_snapshot

# Snapshot location:
# <data_dir>/<keyspace>/<table>/snapshots/<snapshot_name>/

# List snapshots
nodetool listsnapshots

# Clear a snapshot (free disk space)
nodetool clearsnapshot -t daily_backup_2024_01_15

# Restore from snapshot:
# 1. Stop Cassandra
# 2. Clear commitlog and data directories
# 3. Copy snapshot SSTable files to the table's data directory
# 4. Start Cassandra
# 5. Run nodetool repair to ensure consistency

Backup Tools

Tool	Type	Features
nodetool snapshot	Built-in	Local hardlinks, manual upload to S3
Medusa (Spotify)	Open source	Automated S3/GCS backup, point-in-time restore, cluster-wide coordination
Priam (Netflix)	Open source	AWS-focused, automated backup/restore, token management
Instaclustr Shotover	Commercial	Continuous backup, minimal RPO
Cloud provider snapshots	Infrastructure	EBS/disk snapshots (not Cassandra-aware)

Snapshots Are Per-Node

A snapshot only captures data on the local node. For a full cluster backup, you must snapshot every node. Tools like Medusa coordinate this across the cluster and upload to object storage. For restore, you need snapshots from enough nodes to cover all token ranges (at least RF nodes per range).

Disaster Recovery

Cassandra's multi-DC replication is the primary disaster recovery mechanism. With data replicated across DCs, a full DC failure is survivable without any restore process — the remaining DCs continue serving traffic immediately.

Scenario	Recovery Method	RPO	RTO
Single node failure	Automatic (replicas serve traffic)	0	0 (instant)
Rack failure	Automatic (cross-rack replicas)	0	0 (instant)
Full DC failure (multi-DC)	Automatic (other DCs serve traffic)	~100ms	0 (instant)
Full DC failure (single-DC)	Restore from backup	Minutes to hours	Hours
Data corruption (logical)	Point-in-time restore from backup	Depends on backup frequency	Hours
Accidental deletion	Restore specific table from snapshot	Last snapshot time	Minutes to hours

dr-strategy.txttext

Disaster Recovery Strategy (Multi-DC):

Primary defense: Multi-DC replication
  - RF=3 in each of 2+ DCs
  - LOCAL_QUORUM ensures each DC is self-sufficient
  - DC failure = automatic failover (no action needed)

Secondary defense: Regular backups
  - Daily snapshots uploaded to S3/GCS (cross-region)
  - Medusa for automated, coordinated cluster backup
  - Retain 7-30 days of snapshots

Recovery procedures:
  1. Single node: replace with new node, run repair
  2. Multiple nodes: replace, rebuild from remaining replicas
  3. Full DC: add new DC, run nodetool rebuild
  4. Logical corruption: restore from last known-good snapshot
  5. Complete loss: restore all nodes from backup, repair

Testing:
  - Quarterly DR drills (restore from backup to test cluster)
  - Chaos engineering (kill nodes/racks, verify automatic recovery)
  - Validate backup integrity (restore and query sample data)

Multi-DC Is Your Best DR

If you can afford it, multi-DC replication is far superior to backup/restore for DR. Recovery is instant (zero RTO), data loss is minimal (sub-second RPO), and no manual intervention is needed. Backups are still necessary for logical corruption (bad data written to all replicas) but not for infrastructure failures.

ScyllaDB — The C++ Alternative

ScyllaDB is a ground-up C++ rewrite of Cassandra built on the Seastar framework. It eliminates the JVM entirely, using a shard-per-core architecture where each CPU core operates independently with its own memory, I/O, and data. The result is 5-10x higher throughput with consistent, low tail latency.

ScyllaDB Key Advantages

✅No GC pauses — C++ with manual memory management, no stop-the-world events
✅Shard-per-core — each CPU core is independent, no locks, no shared state
✅5-10x throughput per node — fewer nodes needed for same workload
✅Consistent tail latency — p99 stays low without GC spikes
✅Drop-in compatible — same CQL, same drivers, same tools (nodetool equivalent: scylla-tools)
✅Automatic tuning — self-optimizing I/O scheduler and memory allocation

scylladb-migration.txttext

Migrating from Cassandra to ScyllaDB:

1. Schema: Export with cqlsh, import directly (100% CQL compatible)
2. Data: Use Spark migrator, sstableloader, or ScyllaDB Migrator
3. Drivers: Same Cassandra drivers work unchanged
4. Operations: scylla replaces cassandra process
              nodetool works the same (or use scylla-tools)
5. Configuration: scylla.yaml similar to cassandra.yaml

Key differences in operation:
  - No JVM tuning needed (no heap, no GC configuration)
  - CPU pinning is automatic (shard-per-core)
  - I/O scheduler is self-tuning
  - Compaction runs per-shard (parallel, no global lock)
  - Repair is faster (parallel, per-shard)

Typical migration result:
  Cassandra: 12 nodes, i3.2xlarge, 8 GB heap each
  ScyllaDB:  3 nodes, i3.2xlarge (same hardware, 4x fewer nodes)
  Same throughput, lower p99 latency

Choose ScyllaDB When

Tail latency (p99) is critical
Want fewer nodes (lower infrastructure cost)
GC pauses are causing issues in Cassandra
Starting a new project (no migration needed)
Need higher throughput per node

Stay with Cassandra When

Large existing Cassandra investment
Team has deep Cassandra expertise
Need Apache governance / community
Using Cassandra-specific features (MVs, SASI)
Prefer fully open-source (no enterprise tier)

Cassandra vs Alternatives

Choosing the right database depends on your access patterns, consistency requirements, operational capabilities, and scale. Here's how Cassandra compares to common alternatives.

Aspect	Cassandra	DynamoDB	PostgreSQL	MongoDB
Architecture	Masterless ring	Managed (hidden)	Primary/Replica	Primary/Secondary
Consistency	Tunable (AP default)	Tunable (eventual/strong)	Strong (ACID)	Tunable (eventual/strong)
Scale model	Horizontal (add nodes)	Automatic (managed)	Vertical + read replicas	Horizontal (sharding)
Write throughput	Excellent (linear scale)	Excellent (managed)	Good (single primary)	Good (sharded)
Query flexibility	Low (CQL, no joins)	Low (key-value, no joins)	High (full SQL, joins)	Medium (rich queries, no joins)
Operations	Complex (self-managed)	Zero (fully managed)	Moderate	Moderate
Multi-DC	Native (active-active)	Global Tables	Logical replication	Atlas Global Clusters
Cost at scale	Low (open source + infra)	High (per-request pricing)	Moderate	Moderate to high
Best for	Write-heavy, known patterns	Serverless, AWS-native	Complex queries, ACID	Flexible schema, moderate scale

When to Choose Cassandra

Cassandra Is the Right Choice When

✅Write-heavy workloads (IoT, time-series, event logging, messaging)
✅Multi-DC active-active is required (global presence, zero-downtime DR)
✅Linear scalability needed (double nodes = double throughput)
✅Access patterns are known and stable (query-first design is acceptable)
✅High availability is more important than strong consistency
✅You have operational expertise (or will invest in it)

Cassandra Is the Wrong Choice When

❌Ad-hoc queries and analytics are primary use case (use PostgreSQL + analytics DB)
❌Strong consistency across entities is required (use PostgreSQL)
❌Team lacks distributed systems expertise (use managed service like DynamoDB)
❌Small scale where operational complexity isn't justified (use PostgreSQL)
❌Need full-text search (use Elasticsearch alongside Cassandra)
❌Frequent schema changes and evolving access patterns (use MongoDB)

Interview Questions

Q:How does Cassandra handle authentication and authorization?

A: Authentication via PasswordAuthenticator (username/password stored in system_auth keyspace, bcrypt hashed). Authorization via CassandraAuthorizer with role-based access control (RBAC). Roles can be granted specific permissions (SELECT, MODIFY, CREATE, etc.) on specific resources (keyspaces, tables). Default credentials (cassandra/cassandra) must be changed immediately in production.

Q:What backup strategies are available for Cassandra?

A: (1) nodetool snapshot: instant hardlink-based local backup (zero-copy, milliseconds). Must be done on every node. (2) Medusa: automated cluster-wide backup to S3/GCS with point-in-time restore. (3) Incremental backup: copies each new SSTable as it's flushed (continuous but complex to manage). (4) Multi-DC replication: best DR strategy — instant failover, zero RPO for infrastructure failures.

Q:Compare Cassandra and DynamoDB — when would you choose each?

A: Choose Cassandra when: you need multi-DC active-active, want to avoid vendor lock-in, have operational expertise, or need to control costs at massive scale (open source). Choose DynamoDB when: you want zero operations (fully managed), are AWS-native, need automatic scaling, or lack distributed systems expertise. DynamoDB is simpler but more expensive at scale and locked to AWS.

Q:How does ScyllaDB achieve better performance than Cassandra?

A: Three key architectural differences: (1) C++ instead of Java — no garbage collection pauses. (2) Shard-per-core (Seastar framework) — each CPU core owns its data independently, no locks or shared state. (3) Userspace I/O scheduling — bypasses kernel I/O scheduler for predictable latency. Result: 5-10x throughput per node with consistent tail latency. Same CQL protocol — existing drivers work unchanged.

Q:What network ports does Cassandra use and how should they be secured?

A: 9042: CQL client connections (expose only to app servers). 7000/7001: inter-node communication (Cassandra nodes only, never external). 7199: JMX/nodetool (admin hosts only). 9160: Thrift (deprecated, disable). Security: use private subnets, security groups restricting each port to minimum required sources. Enable TLS on both client-to-node (9042) and node-to-node (7000) in production.

Common Mistakes

🔓

Running production without authentication enabled

Cassandra ships with AllowAllAuthenticator — anyone can connect without credentials. Automated scanners find exposed Cassandra ports within hours and can read/delete all data.

✅Enable PasswordAuthenticator and CassandraAuthorizer in cassandra.yaml. Change default credentials immediately. Use network security (firewalls) as defense-in-depth, not the only protection.

📡

Exposing JMX port (7199) to the network

JMX allows full administrative control — decommission nodes, drop tables, read all data. If exposed without authentication, anyone can destroy the cluster.

✅Bind JMX to localhost only (or use JMX authentication). Access via SSH tunnel or bastion host. Never expose 7199 beyond the admin network.

💾

Relying only on multi-DC replication for backup

Multi-DC protects against infrastructure failures but not logical corruption. A bad application deploy that writes corrupt data replicates the corruption to all DCs instantly.

✅Maintain regular snapshots (daily) uploaded to object storage in addition to multi-DC replication. Snapshots protect against logical corruption, accidental deletes, and bad deploys.

🔐

Not enabling node-to-node encryption

Without inter-node TLS, gossip, streaming, and replication traffic is plaintext. An attacker on the network can read all data in transit and potentially inject rogue nodes into the cluster.

✅Enable server_encryption_options with internode_encryption: all and require_client_auth: true (mutual TLS). This prevents both eavesdropping and unauthorized nodes joining the cluster.

📋

Not testing backup restore procedures

Taking snapshots but never testing restore. When disaster strikes, teams discover their backups are incomplete, corrupted, or the restore process takes longer than expected.

✅Quarterly DR drills: restore from backup to a test cluster, verify data integrity by querying sample data, measure actual RTO. Automate the restore process and document it.