Operations, Security & Limits
Running DynamoDB in production ā IAM fine-grained access, DAX caching, monitoring, backup strategies, and the hard limits you must know.
Table of Contents
IAM & Fine-Grained Access Control
DynamoDB has no database-level users or passwords. All authentication and authorization is through AWS IAM. This enables fine-grained access control down to individual items and attributes.
IAM Access Control Features
- ā No database credentials ā all auth via AWS IAM policies
- ā IAM policies control: which tables, which operations, which items
- ā dynamodb:LeadingKeys condition: restrict access to items where PK matches caller's ID
- ā dynamodb:Attributes condition: restrict which attributes can be read/written
- ā Service roles: Lambda assumes role to access DynamoDB ā no credentials in code
{ "Effect": "Allow", "Action": ["dynamodb:GetItem", "dynamodb:Query", "dynamodb:UpdateItem"], "Resource": "arn:aws:dynamodb:us-east-1:123456:table/Users", "Condition": { "ForAllValues:StringEquals": { "dynamodb:LeadingKeys": ["${cognito-identity.amazonaws.com:sub}"] }, "ForAllValues:StringEquals": { "dynamodb:Attributes": ["userId", "name", "email", "preferences"] }, "StringEqualsIfExists": { "dynamodb:Select": "SPECIFIC_ATTRIBUTES" } } }
Zero Trust by Default
DynamoDB denies all access unless explicitly granted. A Lambda function with no IAM policy cannot read or write any table. Grant least-privilege: specific tables, specific operations, specific items when possible.
Encryption & VPC Endpoints
Encryption
| Type | Key Management | Cost | Use Case |
|---|---|---|---|
| AWS owned key | AWS manages entirely | Free (default) | Most workloads |
| AWS managed key (aws/dynamodb) | AWS manages, CloudTrail visible | KMS charges | Audit trail needed |
| Customer managed key (CMK) | You control rotation, access | KMS charges | Compliance, full control |
Encryption Guarantees
- ā Encryption at rest: always on ā cannot disable
- ā Encryption in transit: TLS always enforced (HTTPS only)
- ā Cannot access DynamoDB over plain HTTP ā security by default
VPC Endpoints
VPC Endpoint Benefits
- ā Gateway endpoint: free, routes DynamoDB traffic within AWS network
- ā Traffic never leaves AWS network ā no internet gateway or NAT needed
- ā Required for compliance: data must not traverse public internet
- ā No code changes needed ā just route table configuration
DAX (DynamoDB Accelerator)
DAX is an in-memory cache cluster purpose-built for DynamoDB. It sits between your application and DynamoDB, providing sub-millisecond reads for cached items with minimal code changes.
| Feature | DAX | ElastiCache (Redis) |
|---|---|---|
| Purpose | DynamoDB-specific caching | General-purpose caching |
| API compatibility | Drop-in DynamoDB SDK replacement | Separate Redis client needed |
| Cache type | Write-through (item + query cache) | Application-managed |
| Consistency | Eventually consistent only | Application-controlled |
| Latency | Sub-millisecond (microseconds) | Sub-millisecond |
| Code changes | Minimal (swap client) | Significant (cache logic) |
| Use case | Read-heavy DynamoDB workloads | Any caching need |
When DAX is NOT Suitable
- āStrongly consistent reads required (DAX is always eventually consistent)
- āWrite-heavy workloads with no read pattern
- āScan-heavy workloads (query cache less effective)
- āApplications that need cache invalidation control
- āCost-sensitive: DAX clusters have hourly charges regardless of usage
Read Request
Application calls DAX client (same API as DynamoDB)
Cache Check
DAX checks item cache (GetItem) or query cache (Query)
Cache Hit
Return cached result in microseconds ā no DynamoDB call
Cache Miss
DAX reads from DynamoDB, caches result, returns to application
Write-Through
Writes go to both DAX and DynamoDB simultaneously
Hot Partition Detection & Mitigation
Detecting Hot Partitions
| Tool | What It Shows | When to Use |
|---|---|---|
| CloudWatch ThrottledRequests | Requests rejected due to capacity | Alert on any throttling |
| Contributor Insights | Most accessed and throttled keys | Identify specific hot keys |
| CloudWatch ConsumedCapacity | Actual usage vs provisioned | Capacity planning |
| AWS X-Ray | Individual request traces | Diagnose specific throttled operations |
Mitigation Strategies
Hot Partition Solutions
- ā Write sharding: append random suffix (1-N) to partition key, scatter-gather on reads
- ā Calculated sharding: suffix = hash(userId) % N ā deterministic, no scatter-gather for known keys
- ā Caching: DAX or application-level cache to absorb read hot spots
- ā Key redesign: choose higher cardinality partition keys
- ā On-demand mode: removes capacity-based throttling (hot partition throttling still possible)
- ā Request coalescing: batch reads, reduce per-item request rate
Write Sharding Example: Problem: "LEADERBOARD" partition key gets all writes Solution: Shard the key Write: PK = "LEADERBOARD#" + random(1, 10) Read: Query all 10 shards, merge results in application PK = LEADERBOARD#1 ā scores for shard 1 PK = LEADERBOARD#2 ā scores for shard 2 ... PK = LEADERBOARD#10 ā scores for shard 10 Trade-off: 10Ć read amplification for 10Ć write distribution
Monitoring & CloudWatch Metrics
Critical Metrics to Monitor
| Metric | What It Means | Alert Threshold |
|---|---|---|
| ThrottledRequests | Requests rejected (capacity exceeded) | Any non-zero value |
| ConsumedReadCapacityUnits | Actual RCU usage | > 80% of provisioned |
| ConsumedWriteCapacityUnits | Actual WCU usage | > 80% of provisioned |
| SystemErrors | 5xx errors from DynamoDB service | Any non-zero value |
| UserErrors | 4xx errors (bad requests, conditions) | Sudden spike |
| SuccessfulRequestLatency | p50, p90, p99 per operation | p99 > 50ms |
| ConditionalCheckFailedRequests | Optimistic locking conflicts | High rate = contention |
| ReplicationLatency | Global Tables lag between regions | > 5 seconds |
Contributor Insights
Contributor Insights Capabilities
- ā Identifies most frequently accessed partition keys and sort keys
- ā Shows most throttled keys ā pinpoints hot partition problems
- ā Enables hot key detection without application instrumentation
- ā Additional cost ā enable for tables with suspected hot partition issues
- ā Essential for diagnosing ProvisionedThroughputExceededException
Backup & Recovery
| Feature | On-Demand Backup | Point-in-Time Recovery (PITR) |
|---|---|---|
| What it does | Full table snapshot at a point in time | Continuous backup of last 35 days |
| Granularity | Entire table at backup time | Any second within 35-day window |
| Performance impact | None (uses snapshots) | None |
| Restore | To new table (cannot restore in-place) | To new table (cannot restore in-place) |
| Retention | Indefinite (until you delete) | Rolling 35-day window |
| Cost | Per GB stored | Per GB stored + small per-table charge |
| Use case | Before migrations, compliance | Accidental deletes, bad writes, bugs |
Restore Creates a New Table
Both backup methods restore to a NEW table ā you cannot restore in-place. After restore, you must update your application to point to the new table (or rename). Plan for this in your disaster recovery runbook.
Export to S3
S3 Export Features
- ā Export entire table to S3 in DynamoDB JSON or Amazon Ion format
- ā No capacity consumed ā uses PITR snapshots
- ā Use cases: analytics, data lake ingestion, long-term archival
- ā Incremental export: only changes since last export
- ā Integrates with Athena for SQL queries on exported data
Limits You Must Know
| Limit | Value | Impact |
|---|---|---|
| Maximum item size | 400 KB | Design for bounded items, large data in S3 |
| Maximum partition throughput | 3,000 RCU / 1,000 WCU | Hot partition ceiling |
| Maximum item collection size (LSI) | 10 GB | All items sharing a PK + LSI data |
| Maximum LSIs per table | 5 | Must be created at table creation |
| Maximum GSIs per table | 20 (soft limit) | Can request increase |
| Maximum tables per account/region | 2,500 (soft limit) | Can request increase |
| BatchWriteItem size | 25 items or 16 MB | Use for bulk operations |
| BatchGetItem size | 100 items or 16 MB | Use for multi-item fetches |
| Transaction size | 25 items or 4 MB | Atomic multi-item operations |
| Attribute name length | 64 KB | Keep names short (counts toward 400 KB) |
| Nested depth (Maps/Lists) | 32 levels | Rarely a practical issue |
| Query/Scan response size | 1 MB per call | Paginate with LastEvaluatedKey |
The Limits That Bite
The 400 KB item limit and 10 GB partition limit are the ones that cause production incidents. Design for them from day one. The 1 MB response limit means you must always handle pagination. The 3,000 RCU / 1,000 WCU per partition limit means hot keys have a hard ceiling regardless of table-level capacity.
Migration Patterns
Migration Strategies
- ā From relational to DynamoDB: access pattern analysis first, then model
- ā Dual-write migration: write to both old DB and DynamoDB during transition
- ā Backfill: export from old DB, bulk load via BatchWriteItem
- ā Cutover: switch reads to DynamoDB, stop writes to old DB
- ā Why migrations are hard: must rethink data model, not just move data
Interview Questions
Q:How does DynamoDB handle security without database-level users?
A: All access is controlled through AWS IAM policies. Fine-grained access control uses IAM conditions: dynamodb:LeadingKeys restricts access to items where the partition key matches the caller's identity (e.g., Cognito user ID). dynamodb:Attributes restricts which attributes can be read/written. No credentials are stored in application code ā Lambda assumes an IAM role.
Q:What is DAX and when would you NOT use it?
A: DAX is an in-memory cache for DynamoDB with sub-millisecond reads and drop-in SDK compatibility. Don't use it when: you need strongly consistent reads (DAX is always eventually consistent), write-heavy workloads with few reads, you need fine-grained cache invalidation control, or cost is a concern (DAX clusters charge hourly regardless of usage).
Q:How do you detect and fix a hot partition?
A: Detection: CloudWatch ThrottledRequests metric (any non-zero = problem), Contributor Insights (shows exact hot keys). Fix: (1) Redesign partition key for higher cardinality, (2) Write sharding with random suffix, (3) DAX for read hot spots, (4) On-demand mode to reduce capacity-based throttling. The root cause is always key design ā operational fixes are band-aids.
Q:What is the difference between on-demand backup and PITR?
A: On-demand backup: manual snapshot at a specific moment, stored indefinitely, good for pre-migration safety. PITR: continuous backup of the last 35 days, restore to any second within that window, good for accidental deletes or bad writes. Both restore to a new table (not in-place). PITR is more flexible but has a 35-day rolling window.
Q:What are the most important DynamoDB limits to design around?
A: 400 KB item size (keep items lean, large data in S3), 10 GB partition collection limit (time-bucket partition keys), 3000 RCU / 1000 WCU per partition (avoid hot keys), 1 MB response limit (always handle pagination), 25 items per transaction (batch complex operations). These limits are hard ā hitting them causes failures, not degradation.
Common Mistakes
Not enabling PITR on production tables
Without PITR, an accidental DeleteItem or bad deployment that corrupts data is unrecoverable. PITR costs pennies per GB and provides 35-day recovery. Enable it on every production table ā no exceptions.
ā Enable PITR on all production tables immediately. Cost is negligible compared to data loss risk.
Using DAX for write-heavy workloads
DAX is a read cache. Writes go through DAX to DynamoDB (write-through) but don't benefit from caching. If your workload is 90% writes, DAX adds cost and latency without benefit.
ā Use DAX only when read-to-write ratio is high. For write-heavy tables, focus on key design and capacity planning instead.
Not monitoring GSI capacity independently
GSIs have separate provisioned capacity. If a GSI is throttled, writes to the BASE TABLE fail. Many teams monitor only the base table metrics and miss GSI throttling as the root cause.
ā Set CloudWatch alarms on each GSI's ConsumedWriteCapacityUnits and ThrottledRequests independently.
Assuming DynamoDB limits are soft
The 400 KB item limit, 10 GB partition limit, and per-partition throughput limits are HARD limits. Hitting them causes immediate failures (not graceful degradation).
ā Design for these limits from day one. Add item size validation in application code. Monitor item collection sizes.
Over-permissive IAM policies for DynamoDB access
Granting dynamodb:* on Resource: * gives full access to all tables. This violates least-privilege and creates security risk.
ā Use specific actions (GetItem, Query), specific table ARNs, and condition keys (LeadingKeys) to restrict access to relevant items only.