IAMDAXCloudWatchContributor InsightsBackupPITRLimits

Operations, Security & Limits

Running DynamoDB in production — IAM fine-grained access, DAX caching, monitoring, backup strategies, and the hard limits you must know.

40 min read9 sections

IAM & Fine-Grained Access Control

DynamoDB has no database-level users or passwords. All authentication and authorization is through AWS IAM. This enables fine-grained access control down to individual items and attributes.

IAM Access Control Features

✅No database credentials — all auth via AWS IAM policies
✅IAM policies control: which tables, which operations, which items
✅dynamodb:LeadingKeys condition: restrict access to items where PK matches caller's ID
✅dynamodb:Attributes condition: restrict which attributes can be read/written
✅Service roles: Lambda assumes role to access DynamoDB — no credentials in code

iam-policy.jsonjson

{
  "Effect": "Allow",
  "Action": ["dynamodb:GetItem", "dynamodb:Query", "dynamodb:UpdateItem"],
  "Resource": "arn:aws:dynamodb:us-east-1:123456:table/Users",
  "Condition": {
    "ForAllValues:StringEquals": {
      "dynamodb:LeadingKeys": ["${cognito-identity.amazonaws.com:sub}"]
    },
    "ForAllValues:StringEquals": {
      "dynamodb:Attributes": ["userId", "name", "email", "preferences"]
    },
    "StringEqualsIfExists": {
      "dynamodb:Select": "SPECIFIC_ATTRIBUTES"
    }
  }
}

Zero Trust by Default

DynamoDB denies all access unless explicitly granted. A Lambda function with no IAM policy cannot read or write any table. Grant least-privilege: specific tables, specific operations, specific items when possible.

Encryption & VPC Endpoints

Encryption

Type	Key Management	Cost	Use Case
AWS owned key	AWS manages entirely	Free (default)	Most workloads
AWS managed key (aws/dynamodb)	AWS manages, CloudTrail visible	KMS charges	Audit trail needed
Customer managed key (CMK)	You control rotation, access	KMS charges	Compliance, full control

Encryption Guarantees

✅Encryption at rest: always on — cannot disable
✅Encryption in transit: TLS always enforced (HTTPS only)
✅Cannot access DynamoDB over plain HTTP — security by default

VPC Endpoints

VPC Endpoint Benefits

✅Gateway endpoint: free, routes DynamoDB traffic within AWS network
✅Traffic never leaves AWS network — no internet gateway or NAT needed
✅Required for compliance: data must not traverse public internet
✅No code changes needed — just route table configuration

DAX (DynamoDB Accelerator)

DAX is an in-memory cache cluster purpose-built for DynamoDB. It sits between your application and DynamoDB, providing sub-millisecond reads for cached items with minimal code changes.

Feature	DAX	ElastiCache (Redis)
Purpose	DynamoDB-specific caching	General-purpose caching
API compatibility	Drop-in DynamoDB SDK replacement	Separate Redis client needed
Cache type	Write-through (item + query cache)	Application-managed
Consistency	Eventually consistent only	Application-controlled
Latency	Sub-millisecond (microseconds)	Sub-millisecond
Code changes	Minimal (swap client)	Significant (cache logic)
Use case	Read-heavy DynamoDB workloads	Any caching need

When DAX is NOT Suitable

❌Strongly consistent reads required (DAX is always eventually consistent)
❌Write-heavy workloads with no read pattern
❌Scan-heavy workloads (query cache less effective)
❌Applications that need cache invalidation control
❌Cost-sensitive: DAX clusters have hourly charges regardless of usage

Read Request

Application calls DAX client (same API as DynamoDB)

Cache Check

DAX checks item cache (GetItem) or query cache (Query)

Cache Hit

Return cached result in microseconds — no DynamoDB call

Cache Miss

DAX reads from DynamoDB, caches result, returns to application

Write-Through

Writes go to both DAX and DynamoDB simultaneously

Hot Partition Detection & Mitigation

Detecting Hot Partitions

Tool	What It Shows	When to Use
CloudWatch ThrottledRequests	Requests rejected due to capacity	Alert on any throttling
Contributor Insights	Most accessed and throttled keys	Identify specific hot keys
CloudWatch ConsumedCapacity	Actual usage vs provisioned	Capacity planning
AWS X-Ray	Individual request traces	Diagnose specific throttled operations

Mitigation Strategies

Hot Partition Solutions

✅Write sharding: append random suffix (1-N) to partition key, scatter-gather on reads
✅Calculated sharding: suffix = hash(userId) % N — deterministic, no scatter-gather for known keys
✅Caching: DAX or application-level cache to absorb read hot spots
✅Key redesign: choose higher cardinality partition keys
✅On-demand mode: removes capacity-based throttling (hot partition throttling still possible)
✅Request coalescing: batch reads, reduce per-item request rate

write-sharding.txttext

Write Sharding Example:
Problem: "LEADERBOARD" partition key gets all writes

Solution: Shard the key
  Write: PK = "LEADERBOARD#" + random(1, 10)
  Read:  Query all 10 shards, merge results in application

  PK = LEADERBOARD#1  → scores for shard 1
  PK = LEADERBOARD#2  → scores for shard 2
  ...
  PK = LEADERBOARD#10 → scores for shard 10

Trade-off: 10× read amplification for 10× write distribution

Monitoring & CloudWatch Metrics

Critical Metrics to Monitor

Metric	What It Means	Alert Threshold
ThrottledRequests	Requests rejected (capacity exceeded)	Any non-zero value
ConsumedReadCapacityUnits	Actual RCU usage	> 80% of provisioned
ConsumedWriteCapacityUnits	Actual WCU usage	> 80% of provisioned
SystemErrors	5xx errors from DynamoDB service	Any non-zero value
UserErrors	4xx errors (bad requests, conditions)	Sudden spike
SuccessfulRequestLatency	p50, p90, p99 per operation	p99 > 50ms
ConditionalCheckFailedRequests	Optimistic locking conflicts	High rate = contention
ReplicationLatency	Global Tables lag between regions	> 5 seconds

Contributor Insights

Contributor Insights Capabilities

✅Identifies most frequently accessed partition keys and sort keys
✅Shows most throttled keys — pinpoints hot partition problems
✅Enables hot key detection without application instrumentation
✅Additional cost — enable for tables with suspected hot partition issues
✅Essential for diagnosing ProvisionedThroughputExceededException

Backup & Recovery

Feature	On-Demand Backup	Point-in-Time Recovery (PITR)
What it does	Full table snapshot at a point in time	Continuous backup of last 35 days
Granularity	Entire table at backup time	Any second within 35-day window
Performance impact	None (uses snapshots)	None
Restore	To new table (cannot restore in-place)	To new table (cannot restore in-place)
Retention	Indefinite (until you delete)	Rolling 35-day window
Cost	Per GB stored	Per GB stored + small per-table charge
Use case	Before migrations, compliance	Accidental deletes, bad writes, bugs

Restore Creates a New Table

Both backup methods restore to a NEW table — you cannot restore in-place. After restore, you must update your application to point to the new table (or rename). Plan for this in your disaster recovery runbook.

Export to S3

S3 Export Features

✅Export entire table to S3 in DynamoDB JSON or Amazon Ion format
✅No capacity consumed — uses PITR snapshots
✅Use cases: analytics, data lake ingestion, long-term archival
✅Incremental export: only changes since last export
✅Integrates with Athena for SQL queries on exported data

Limits You Must Know

Limit	Value	Impact
Maximum item size	400 KB	Design for bounded items, large data in S3
Maximum partition throughput	3,000 RCU / 1,000 WCU	Hot partition ceiling
Maximum item collection size (LSI)	10 GB	All items sharing a PK + LSI data
Maximum LSIs per table	5	Must be created at table creation
Maximum GSIs per table	20 (soft limit)	Can request increase
Maximum tables per account/region	2,500 (soft limit)	Can request increase
BatchWriteItem size	25 items or 16 MB	Use for bulk operations
BatchGetItem size	100 items or 16 MB	Use for multi-item fetches
Transaction size	25 items or 4 MB	Atomic multi-item operations
Attribute name length	64 KB	Keep names short (counts toward 400 KB)
Nested depth (Maps/Lists)	32 levels	Rarely a practical issue
Query/Scan response size	1 MB per call	Paginate with LastEvaluatedKey

The Limits That Bite

The 400 KB item limit and 10 GB partition limit are the ones that cause production incidents. Design for them from day one. The 1 MB response limit means you must always handle pagination. The 3,000 RCU / 1,000 WCU per partition limit means hot keys have a hard ceiling regardless of table-level capacity.

Migration Patterns

Migration Strategies

✅From relational to DynamoDB: access pattern analysis first, then model
✅Dual-write migration: write to both old DB and DynamoDB during transition
✅Backfill: export from old DB, bulk load via BatchWriteItem
✅Cutover: switch reads to DynamoDB, stop writes to old DB
✅Why migrations are hard: must rethink data model, not just move data

Interview Questions

Q:How does DynamoDB handle security without database-level users?

A: All access is controlled through AWS IAM policies. Fine-grained access control uses IAM conditions: dynamodb:LeadingKeys restricts access to items where the partition key matches the caller's identity (e.g., Cognito user ID). dynamodb:Attributes restricts which attributes can be read/written. No credentials are stored in application code — Lambda assumes an IAM role.

Q:What is DAX and when would you NOT use it?

A: DAX is an in-memory cache for DynamoDB with sub-millisecond reads and drop-in SDK compatibility. Don't use it when: you need strongly consistent reads (DAX is always eventually consistent), write-heavy workloads with few reads, you need fine-grained cache invalidation control, or cost is a concern (DAX clusters charge hourly regardless of usage).

Q:How do you detect and fix a hot partition?

A: Detection: CloudWatch ThrottledRequests metric (any non-zero = problem), Contributor Insights (shows exact hot keys). Fix: (1) Redesign partition key for higher cardinality, (2) Write sharding with random suffix, (3) DAX for read hot spots, (4) On-demand mode to reduce capacity-based throttling. The root cause is always key design — operational fixes are band-aids.

Q:What is the difference between on-demand backup and PITR?

A: On-demand backup: manual snapshot at a specific moment, stored indefinitely, good for pre-migration safety. PITR: continuous backup of the last 35 days, restore to any second within that window, good for accidental deletes or bad writes. Both restore to a new table (not in-place). PITR is more flexible but has a 35-day rolling window.

Q:What are the most important DynamoDB limits to design around?

A: 400 KB item size (keep items lean, large data in S3), 10 GB partition collection limit (time-bucket partition keys), 3000 RCU / 1000 WCU per partition (avoid hot keys), 1 MB response limit (always handle pagination), 25 items per transaction (batch complex operations). These limits are hard — hitting them causes failures, not degradation.

Common Mistakes

💾

Not enabling PITR on production tables

Without PITR, an accidental DeleteItem or bad deployment that corrupts data is unrecoverable. PITR costs pennies per GB and provides 35-day recovery. Enable it on every production table — no exceptions.

✅Enable PITR on all production tables immediately. Cost is negligible compared to data loss risk.

📝

Using DAX for write-heavy workloads

DAX is a read cache. Writes go through DAX to DynamoDB (write-through) but don't benefit from caching. If your workload is 90% writes, DAX adds cost and latency without benefit.

✅Use DAX only when read-to-write ratio is high. For write-heavy tables, focus on key design and capacity planning instead.

📊

Not monitoring GSI capacity independently

GSIs have separate provisioned capacity. If a GSI is throttled, writes to the BASE TABLE fail. Many teams monitor only the base table metrics and miss GSI throttling as the root cause.

✅Set CloudWatch alarms on each GSI's ConsumedWriteCapacityUnits and ThrottledRequests independently.

🚧

Assuming DynamoDB limits are soft

The 400 KB item limit, 10 GB partition limit, and per-partition throughput limits are HARD limits. Hitting them causes immediate failures (not graceful degradation).

✅Design for these limits from day one. Add item size validation in application code. Monitor item collection sizes.

🔓

Over-permissive IAM policies for DynamoDB access

Granting dynamodb:* on Resource: * gives full access to all tables. This violates least-privilege and creates security risk.

✅Use specific actions (GetItem, Query), specific table ARNs, and condition keys (LeadingKeys) to restrict access to relevant items only.