Security, Backup & ScyllaDB
Production Cassandra requires authentication, encryption, and backup strategies. Understanding the security model and disaster recovery options is essential for operating at scale.
Table of Contents
Authentication & Authorization
By default, Cassandra ships with authentication and authorization disabled ā anyone can connect and do anything. Production clusters must enable both. Cassandra uses a role-based access control (RBAC) model where roles can be granted permissions on specific resources.
# Enable authentication (default: AllowAllAuthenticator) authenticator: PasswordAuthenticator # Enable authorization (default: AllowAllAuthorizer) authorizer: CassandraAuthorizer # Role management role_manager: CassandraRoleManager # Password hashing rounds (higher = more secure, slower login) credentials_validity_in_ms: 2000 credentials_update_interval_in_ms: 1000
-- Default superuser (change password immediately!) -- Username: cassandra, Password: cassandra ALTER USER cassandra WITH PASSWORD 'new_secure_password'; -- Create application roles CREATE ROLE app_readwrite WITH PASSWORD = 'strong_pass_123' AND LOGIN = true; CREATE ROLE app_readonly WITH PASSWORD = 'strong_pass_456' AND LOGIN = true; CREATE ROLE admin_role WITH PASSWORD = 'admin_pass_789' AND LOGIN = true AND SUPERUSER = true; -- Grant permissions GRANT SELECT ON KEYSPACE my_app TO app_readonly; GRANT SELECT, MODIFY ON KEYSPACE my_app TO app_readwrite; GRANT ALL PERMISSIONS ON ALL KEYSPACES TO admin_role; -- Fine-grained permissions GRANT SELECT ON TABLE my_app.users TO app_readonly; GRANT MODIFY ON TABLE my_app.sessions TO app_readwrite; -- Revoke permissions REVOKE MODIFY ON KEYSPACE my_app FROM app_readonly; -- List permissions LIST ALL PERMISSIONS OF app_readwrite;
| Permission | Applies To | Description |
|---|---|---|
| SELECT | Keyspace, Table | Read data |
| MODIFY | Keyspace, Table | INSERT, UPDATE, DELETE |
| CREATE | Keyspace, Table | Create new keyspaces/tables |
| ALTER | Keyspace, Table | Modify schema |
| DROP | Keyspace, Table | Delete keyspaces/tables |
| AUTHORIZE | Any resource | Grant/revoke permissions to others |
| ALL PERMISSIONS | Any resource | All of the above |
Change Default Credentials First
The default superuser (cassandra/cassandra) is well-known. The first step after enabling authentication is changing this password and creating a new superuser role. Then disable or restrict the default cassandra user. Automated scanners actively probe for default Cassandra credentials.
Encryption (TLS & At Rest)
Cassandra supports two types of TLS encryption: client-to-node (protecting data in transit from applications) and node-to-node (protecting inter-cluster communication including gossip, streaming, and replication).
# Client-to-node encryption (application ā Cassandra) client_encryption_options: enabled: true optional: false # Require TLS (reject unencrypted connections) keystore: /etc/cassandra/keystore.jks keystore_password: changeit truststore: /etc/cassandra/truststore.jks truststore_password: changeit protocol: TLS # TLSv1.2 minimum recommended cipher_suites: - TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 - TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 require_client_auth: false # Set true for mutual TLS # Node-to-node encryption (Cassandra ā Cassandra) server_encryption_options: internode_encryption: all # Options: none, dc, rack, all keystore: /etc/cassandra/keystore.jks keystore_password: changeit truststore: /etc/cassandra/truststore.jks truststore_password: changeit require_client_auth: true # Mutual TLS between nodes
| Encryption Type | Protects | Configuration |
|---|---|---|
| Client-to-node TLS | App ā Cassandra communication | client_encryption_options |
| Node-to-node TLS | Inter-node gossip, streaming, replication | server_encryption_options |
| At-rest encryption | Data files on disk (SSTables, commit log) | Transparent Data Encryption (TDE) |
Encryption at Rest
Cassandra 5.0+ supports Transparent Data Encryption (TDE) for SSTables and commit logs. For earlier versions, use filesystem-level encryption (LUKS, dm-crypt) or cloud provider encryption (AWS EBS encryption, GCP disk encryption).
Encryption Best Practices
- ā Enable both client-to-node AND node-to-node TLS in production
- ā Use TLSv1.2 or higher ā disable SSLv3 and TLSv1.0/1.1
- ā Rotate certificates before expiry ā automate with cert-manager or similar
- ā Use mutual TLS (mTLS) for node-to-node to prevent rogue nodes joining
- ā For at-rest: use filesystem encryption (LUKS) or cloud-native disk encryption
Network Security
Cassandra uses several network ports for different purposes. Proper firewall configuration is essential ā only expose the minimum required ports to the minimum required networks.
| Port | Purpose | Access Scope |
|---|---|---|
| 9042 | CQL native transport (client connections) | Application servers only |
| 9142 | CQL native transport with TLS | Application servers only |
| 7000 | Inter-node communication (gossip, streaming) | Cassandra nodes only |
| 7001 | Inter-node communication with TLS | Cassandra nodes only |
| 7199 | JMX monitoring (nodetool) | Admin/monitoring hosts only |
| 9160 | Thrift (legacy, deprecated) | Disable entirely |
Network Security Architecture: āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā Application Layer (VPC/Subnet A) ā ā ā ā [App Server 1] [App Server 2] [App Server 3] ā ā ā ā ā ā ā āāāāāāāāāāāāāāāāā¼āāāāāāāāāāāāāāāā ā ā ā Port 9042/9142 only ā āāāāāāāāāāāāāāāāāāāāāāāāā¼āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā āāāāāā¼āāāāā (Security Group / Firewall) ā Allow: ā ā 9042 ā from app subnet ā 7000 ā from cassandra subnet ā 7199 ā from admin subnet āāāāāā¬āāāāā ā āāāāāāāāāāāāāāāāāāāāāāāāā¼āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā ā Cassandra Layer (VPC/Subnet B) ā ā ā ā ā [Cass Node 1] āāāāāāŗ[Cass Node 2] āāāāāāŗ [Cass Node 3]ā ā Port 7000/7001 (inter-node) ā āāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāāā Rules: ā 9042: Only from application subnet ā 7000/7001: Only between Cassandra nodes ā 7199: Only from admin/monitoring hosts ā Never expose 7000/7199 to the internet ā Never expose 9042 to the internet (use VPN/bastion)
Bind Address Configuration
Configure listen_address (for inter-node) and rpc_address (for client connections) to bind to private IPs only. Never bind to 0.0.0.0 in production unless behind a firewall. For multi-DC, use broadcast_address for the public/cross-DC IP and listen_address for the private/intra-DC IP.
Snapshots & Backup
Cassandra snapshots create instant, zero-copy backups using filesystem hardlinks. Because SSTables are immutable, a snapshot is just a set of hardlinks to existing files ā it completes in milliseconds regardless of data size.
Flush Memtables
nodetool snapshot first flushes all memtables to SSTables (ensures all data is on disk)
Create Hardlinks
Creates hardlinks to all SSTable files in a snapshot directory ā instant, no data copied
Snapshot Complete
Snapshot directory contains hardlinks. Original SSTables can be compacted without affecting the snapshot.
Upload to Remote Storage
Copy snapshot files to S3/GCS for off-node backup (this is the slow part)
# Take a snapshot of all keyspaces nodetool snapshot -t daily_backup_2024_01_15 # Take a snapshot of a specific keyspace nodetool snapshot my_keyspace -t my_snapshot # Snapshot location: # <data_dir>/<keyspace>/<table>/snapshots/<snapshot_name>/ # List snapshots nodetool listsnapshots # Clear a snapshot (free disk space) nodetool clearsnapshot -t daily_backup_2024_01_15 # Restore from snapshot: # 1. Stop Cassandra # 2. Clear commitlog and data directories # 3. Copy snapshot SSTable files to the table's data directory # 4. Start Cassandra # 5. Run nodetool repair to ensure consistency
Backup Tools
| Tool | Type | Features |
|---|---|---|
| nodetool snapshot | Built-in | Local hardlinks, manual upload to S3 |
| Medusa (Spotify) | Open source | Automated S3/GCS backup, point-in-time restore, cluster-wide coordination |
| Priam (Netflix) | Open source | AWS-focused, automated backup/restore, token management |
| Instaclustr Shotover | Commercial | Continuous backup, minimal RPO |
| Cloud provider snapshots | Infrastructure | EBS/disk snapshots (not Cassandra-aware) |
Snapshots Are Per-Node
A snapshot only captures data on the local node. For a full cluster backup, you must snapshot every node. Tools like Medusa coordinate this across the cluster and upload to object storage. For restore, you need snapshots from enough nodes to cover all token ranges (at least RF nodes per range).
Disaster Recovery
Cassandra's multi-DC replication is the primary disaster recovery mechanism. With data replicated across DCs, a full DC failure is survivable without any restore process ā the remaining DCs continue serving traffic immediately.
| Scenario | Recovery Method | RPO | RTO |
|---|---|---|---|
| Single node failure | Automatic (replicas serve traffic) | 0 | 0 (instant) |
| Rack failure | Automatic (cross-rack replicas) | 0 | 0 (instant) |
| Full DC failure (multi-DC) | Automatic (other DCs serve traffic) | ~100ms | 0 (instant) |
| Full DC failure (single-DC) | Restore from backup | Minutes to hours | Hours |
| Data corruption (logical) | Point-in-time restore from backup | Depends on backup frequency | Hours |
| Accidental deletion | Restore specific table from snapshot | Last snapshot time | Minutes to hours |
Disaster Recovery Strategy (Multi-DC): Primary defense: Multi-DC replication - RF=3 in each of 2+ DCs - LOCAL_QUORUM ensures each DC is self-sufficient - DC failure = automatic failover (no action needed) Secondary defense: Regular backups - Daily snapshots uploaded to S3/GCS (cross-region) - Medusa for automated, coordinated cluster backup - Retain 7-30 days of snapshots Recovery procedures: 1. Single node: replace with new node, run repair 2. Multiple nodes: replace, rebuild from remaining replicas 3. Full DC: add new DC, run nodetool rebuild 4. Logical corruption: restore from last known-good snapshot 5. Complete loss: restore all nodes from backup, repair Testing: - Quarterly DR drills (restore from backup to test cluster) - Chaos engineering (kill nodes/racks, verify automatic recovery) - Validate backup integrity (restore and query sample data)
Multi-DC Is Your Best DR
If you can afford it, multi-DC replication is far superior to backup/restore for DR. Recovery is instant (zero RTO), data loss is minimal (sub-second RPO), and no manual intervention is needed. Backups are still necessary for logical corruption (bad data written to all replicas) but not for infrastructure failures.
ScyllaDB ā The C++ Alternative
ScyllaDB is a ground-up C++ rewrite of Cassandra built on the Seastar framework. It eliminates the JVM entirely, using a shard-per-core architecture where each CPU core operates independently with its own memory, I/O, and data. The result is 5-10x higher throughput with consistent, low tail latency.
ScyllaDB Key Advantages
- ā No GC pauses ā C++ with manual memory management, no stop-the-world events
- ā Shard-per-core ā each CPU core is independent, no locks, no shared state
- ā 5-10x throughput per node ā fewer nodes needed for same workload
- ā Consistent tail latency ā p99 stays low without GC spikes
- ā Drop-in compatible ā same CQL, same drivers, same tools (nodetool equivalent: scylla-tools)
- ā Automatic tuning ā self-optimizing I/O scheduler and memory allocation
Migrating from Cassandra to ScyllaDB: 1. Schema: Export with cqlsh, import directly (100% CQL compatible) 2. Data: Use Spark migrator, sstableloader, or ScyllaDB Migrator 3. Drivers: Same Cassandra drivers work unchanged 4. Operations: scylla replaces cassandra process nodetool works the same (or use scylla-tools) 5. Configuration: scylla.yaml similar to cassandra.yaml Key differences in operation: - No JVM tuning needed (no heap, no GC configuration) - CPU pinning is automatic (shard-per-core) - I/O scheduler is self-tuning - Compaction runs per-shard (parallel, no global lock) - Repair is faster (parallel, per-shard) Typical migration result: Cassandra: 12 nodes, i3.2xlarge, 8 GB heap each ScyllaDB: 3 nodes, i3.2xlarge (same hardware, 4x fewer nodes) Same throughput, lower p99 latency
Choose ScyllaDB When
- Tail latency (p99) is critical
- Want fewer nodes (lower infrastructure cost)
- GC pauses are causing issues in Cassandra
- Starting a new project (no migration needed)
- Need higher throughput per node
Stay with Cassandra When
- Large existing Cassandra investment
- Team has deep Cassandra expertise
- Need Apache governance / community
- Using Cassandra-specific features (MVs, SASI)
- Prefer fully open-source (no enterprise tier)
Cassandra vs Alternatives
Choosing the right database depends on your access patterns, consistency requirements, operational capabilities, and scale. Here's how Cassandra compares to common alternatives.
| Aspect | Cassandra | DynamoDB | PostgreSQL | MongoDB |
|---|---|---|---|---|
| Architecture | Masterless ring | Managed (hidden) | Primary/Replica | Primary/Secondary |
| Consistency | Tunable (AP default) | Tunable (eventual/strong) | Strong (ACID) | Tunable (eventual/strong) |
| Scale model | Horizontal (add nodes) | Automatic (managed) | Vertical + read replicas | Horizontal (sharding) |
| Write throughput | Excellent (linear scale) | Excellent (managed) | Good (single primary) | Good (sharded) |
| Query flexibility | Low (CQL, no joins) | Low (key-value, no joins) | High (full SQL, joins) | Medium (rich queries, no joins) |
| Operations | Complex (self-managed) | Zero (fully managed) | Moderate | Moderate |
| Multi-DC | Native (active-active) | Global Tables | Logical replication | Atlas Global Clusters |
| Cost at scale | Low (open source + infra) | High (per-request pricing) | Moderate | Moderate to high |
| Best for | Write-heavy, known patterns | Serverless, AWS-native | Complex queries, ACID | Flexible schema, moderate scale |
When to Choose Cassandra
Cassandra Is the Right Choice When
- ā Write-heavy workloads (IoT, time-series, event logging, messaging)
- ā Multi-DC active-active is required (global presence, zero-downtime DR)
- ā Linear scalability needed (double nodes = double throughput)
- ā Access patterns are known and stable (query-first design is acceptable)
- ā High availability is more important than strong consistency
- ā You have operational expertise (or will invest in it)
Cassandra Is the Wrong Choice When
- āAd-hoc queries and analytics are primary use case (use PostgreSQL + analytics DB)
- āStrong consistency across entities is required (use PostgreSQL)
- āTeam lacks distributed systems expertise (use managed service like DynamoDB)
- āSmall scale where operational complexity isn't justified (use PostgreSQL)
- āNeed full-text search (use Elasticsearch alongside Cassandra)
- āFrequent schema changes and evolving access patterns (use MongoDB)
Interview Questions
Q:How does Cassandra handle authentication and authorization?
A: Authentication via PasswordAuthenticator (username/password stored in system_auth keyspace, bcrypt hashed). Authorization via CassandraAuthorizer with role-based access control (RBAC). Roles can be granted specific permissions (SELECT, MODIFY, CREATE, etc.) on specific resources (keyspaces, tables). Default credentials (cassandra/cassandra) must be changed immediately in production.
Q:What backup strategies are available for Cassandra?
A: (1) nodetool snapshot: instant hardlink-based local backup (zero-copy, milliseconds). Must be done on every node. (2) Medusa: automated cluster-wide backup to S3/GCS with point-in-time restore. (3) Incremental backup: copies each new SSTable as it's flushed (continuous but complex to manage). (4) Multi-DC replication: best DR strategy ā instant failover, zero RPO for infrastructure failures.
Q:Compare Cassandra and DynamoDB ā when would you choose each?
A: Choose Cassandra when: you need multi-DC active-active, want to avoid vendor lock-in, have operational expertise, or need to control costs at massive scale (open source). Choose DynamoDB when: you want zero operations (fully managed), are AWS-native, need automatic scaling, or lack distributed systems expertise. DynamoDB is simpler but more expensive at scale and locked to AWS.
Q:How does ScyllaDB achieve better performance than Cassandra?
A: Three key architectural differences: (1) C++ instead of Java ā no garbage collection pauses. (2) Shard-per-core (Seastar framework) ā each CPU core owns its data independently, no locks or shared state. (3) Userspace I/O scheduling ā bypasses kernel I/O scheduler for predictable latency. Result: 5-10x throughput per node with consistent tail latency. Same CQL protocol ā existing drivers work unchanged.
Q:What network ports does Cassandra use and how should they be secured?
A: 9042: CQL client connections (expose only to app servers). 7000/7001: inter-node communication (Cassandra nodes only, never external). 7199: JMX/nodetool (admin hosts only). 9160: Thrift (deprecated, disable). Security: use private subnets, security groups restricting each port to minimum required sources. Enable TLS on both client-to-node (9042) and node-to-node (7000) in production.
Common Mistakes
Running production without authentication enabled
Cassandra ships with AllowAllAuthenticator ā anyone can connect without credentials. Automated scanners find exposed Cassandra ports within hours and can read/delete all data.
ā Enable PasswordAuthenticator and CassandraAuthorizer in cassandra.yaml. Change default credentials immediately. Use network security (firewalls) as defense-in-depth, not the only protection.
Exposing JMX port (7199) to the network
JMX allows full administrative control ā decommission nodes, drop tables, read all data. If exposed without authentication, anyone can destroy the cluster.
ā Bind JMX to localhost only (or use JMX authentication). Access via SSH tunnel or bastion host. Never expose 7199 beyond the admin network.
Relying only on multi-DC replication for backup
Multi-DC protects against infrastructure failures but not logical corruption. A bad application deploy that writes corrupt data replicates the corruption to all DCs instantly.
ā Maintain regular snapshots (daily) uploaded to object storage in addition to multi-DC replication. Snapshots protect against logical corruption, accidental deletes, and bad deploys.
Not enabling node-to-node encryption
Without inter-node TLS, gossip, streaming, and replication traffic is plaintext. An attacker on the network can read all data in transit and potentially inject rogue nodes into the cluster.
ā Enable server_encryption_options with internode_encryption: all and require_client_auth: true (mutual TLS). This prevents both eavesdropping and unauthorized nodes joining the cluster.
Not testing backup restore procedures
Taking snapshots but never testing restore. When disaster strikes, teams discover their backups are incomplete, corrupted, or the restore process takes longer than expected.
ā Quarterly DR drills: restore from backup to a test cluster, verify data integrity by querying sample data, measure actual RTO. Automate the restore process and document it.