IAMDAXCloudWatchContributor InsightsBackupPITRLimits

Operations, Security & Limits

Running DynamoDB in production — IAM fine-grained access, DAX caching, monitoring, backup strategies, and the hard limits you must know.

40 min read9 sections
01

IAM & Fine-Grained Access Control

DynamoDB has no database-level users or passwords. All authentication and authorization is through AWS IAM. This enables fine-grained access control down to individual items and attributes.

IAM Access Control Features

  • āœ…No database credentials — all auth via AWS IAM policies
  • āœ…IAM policies control: which tables, which operations, which items
  • āœ…dynamodb:LeadingKeys condition: restrict access to items where PK matches caller's ID
  • āœ…dynamodb:Attributes condition: restrict which attributes can be read/written
  • āœ…Service roles: Lambda assumes role to access DynamoDB — no credentials in code
iam-policy.jsonjson
{
  "Effect": "Allow",
  "Action": ["dynamodb:GetItem", "dynamodb:Query", "dynamodb:UpdateItem"],
  "Resource": "arn:aws:dynamodb:us-east-1:123456:table/Users",
  "Condition": {
    "ForAllValues:StringEquals": {
      "dynamodb:LeadingKeys": ["${cognito-identity.amazonaws.com:sub}"]
    },
    "ForAllValues:StringEquals": {
      "dynamodb:Attributes": ["userId", "name", "email", "preferences"]
    },
    "StringEqualsIfExists": {
      "dynamodb:Select": "SPECIFIC_ATTRIBUTES"
    }
  }
}

Zero Trust by Default

DynamoDB denies all access unless explicitly granted. A Lambda function with no IAM policy cannot read or write any table. Grant least-privilege: specific tables, specific operations, specific items when possible.

02

Encryption & VPC Endpoints

Encryption

TypeKey ManagementCostUse Case
AWS owned keyAWS manages entirelyFree (default)Most workloads
AWS managed key (aws/dynamodb)AWS manages, CloudTrail visibleKMS chargesAudit trail needed
Customer managed key (CMK)You control rotation, accessKMS chargesCompliance, full control

Encryption Guarantees

  • āœ…Encryption at rest: always on — cannot disable
  • āœ…Encryption in transit: TLS always enforced (HTTPS only)
  • āœ…Cannot access DynamoDB over plain HTTP — security by default

VPC Endpoints

VPC Endpoint Benefits

  • āœ…Gateway endpoint: free, routes DynamoDB traffic within AWS network
  • āœ…Traffic never leaves AWS network — no internet gateway or NAT needed
  • āœ…Required for compliance: data must not traverse public internet
  • āœ…No code changes needed — just route table configuration
03

DAX (DynamoDB Accelerator)

DAX is an in-memory cache cluster purpose-built for DynamoDB. It sits between your application and DynamoDB, providing sub-millisecond reads for cached items with minimal code changes.

FeatureDAXElastiCache (Redis)
PurposeDynamoDB-specific cachingGeneral-purpose caching
API compatibilityDrop-in DynamoDB SDK replacementSeparate Redis client needed
Cache typeWrite-through (item + query cache)Application-managed
ConsistencyEventually consistent onlyApplication-controlled
LatencySub-millisecond (microseconds)Sub-millisecond
Code changesMinimal (swap client)Significant (cache logic)
Use caseRead-heavy DynamoDB workloadsAny caching need

When DAX is NOT Suitable

  • āŒStrongly consistent reads required (DAX is always eventually consistent)
  • āŒWrite-heavy workloads with no read pattern
  • āŒScan-heavy workloads (query cache less effective)
  • āŒApplications that need cache invalidation control
  • āŒCost-sensitive: DAX clusters have hourly charges regardless of usage
1

Read Request

Application calls DAX client (same API as DynamoDB)

2

Cache Check

DAX checks item cache (GetItem) or query cache (Query)

3

Cache Hit

Return cached result in microseconds — no DynamoDB call

4

Cache Miss

DAX reads from DynamoDB, caches result, returns to application

5

Write-Through

Writes go to both DAX and DynamoDB simultaneously

04

Hot Partition Detection & Mitigation

Detecting Hot Partitions

ToolWhat It ShowsWhen to Use
CloudWatch ThrottledRequestsRequests rejected due to capacityAlert on any throttling
Contributor InsightsMost accessed and throttled keysIdentify specific hot keys
CloudWatch ConsumedCapacityActual usage vs provisionedCapacity planning
AWS X-RayIndividual request tracesDiagnose specific throttled operations

Mitigation Strategies

Hot Partition Solutions

  • āœ…Write sharding: append random suffix (1-N) to partition key, scatter-gather on reads
  • āœ…Calculated sharding: suffix = hash(userId) % N — deterministic, no scatter-gather for known keys
  • āœ…Caching: DAX or application-level cache to absorb read hot spots
  • āœ…Key redesign: choose higher cardinality partition keys
  • āœ…On-demand mode: removes capacity-based throttling (hot partition throttling still possible)
  • āœ…Request coalescing: batch reads, reduce per-item request rate
write-sharding.txttext
Write Sharding Example:
Problem: "LEADERBOARD" partition key gets all writes

Solution: Shard the key
  Write: PK = "LEADERBOARD#" + random(1, 10)
  Read:  Query all 10 shards, merge results in application

  PK = LEADERBOARD#1  → scores for shard 1
  PK = LEADERBOARD#2  → scores for shard 2
  ...
  PK = LEADERBOARD#10 → scores for shard 10

Trade-off: 10Ɨ read amplification for 10Ɨ write distribution
05

Monitoring & CloudWatch Metrics

Critical Metrics to Monitor

MetricWhat It MeansAlert Threshold
ThrottledRequestsRequests rejected (capacity exceeded)Any non-zero value
ConsumedReadCapacityUnitsActual RCU usage> 80% of provisioned
ConsumedWriteCapacityUnitsActual WCU usage> 80% of provisioned
SystemErrors5xx errors from DynamoDB serviceAny non-zero value
UserErrors4xx errors (bad requests, conditions)Sudden spike
SuccessfulRequestLatencyp50, p90, p99 per operationp99 > 50ms
ConditionalCheckFailedRequestsOptimistic locking conflictsHigh rate = contention
ReplicationLatencyGlobal Tables lag between regions> 5 seconds

Contributor Insights

Contributor Insights Capabilities

  • āœ…Identifies most frequently accessed partition keys and sort keys
  • āœ…Shows most throttled keys — pinpoints hot partition problems
  • āœ…Enables hot key detection without application instrumentation
  • āœ…Additional cost — enable for tables with suspected hot partition issues
  • āœ…Essential for diagnosing ProvisionedThroughputExceededException
06

Backup & Recovery

FeatureOn-Demand BackupPoint-in-Time Recovery (PITR)
What it doesFull table snapshot at a point in timeContinuous backup of last 35 days
GranularityEntire table at backup timeAny second within 35-day window
Performance impactNone (uses snapshots)None
RestoreTo new table (cannot restore in-place)To new table (cannot restore in-place)
RetentionIndefinite (until you delete)Rolling 35-day window
CostPer GB storedPer GB stored + small per-table charge
Use caseBefore migrations, complianceAccidental deletes, bad writes, bugs

Restore Creates a New Table

Both backup methods restore to a NEW table — you cannot restore in-place. After restore, you must update your application to point to the new table (or rename). Plan for this in your disaster recovery runbook.

Export to S3

S3 Export Features

  • āœ…Export entire table to S3 in DynamoDB JSON or Amazon Ion format
  • āœ…No capacity consumed — uses PITR snapshots
  • āœ…Use cases: analytics, data lake ingestion, long-term archival
  • āœ…Incremental export: only changes since last export
  • āœ…Integrates with Athena for SQL queries on exported data
07

Limits You Must Know

LimitValueImpact
Maximum item size400 KBDesign for bounded items, large data in S3
Maximum partition throughput3,000 RCU / 1,000 WCUHot partition ceiling
Maximum item collection size (LSI)10 GBAll items sharing a PK + LSI data
Maximum LSIs per table5Must be created at table creation
Maximum GSIs per table20 (soft limit)Can request increase
Maximum tables per account/region2,500 (soft limit)Can request increase
BatchWriteItem size25 items or 16 MBUse for bulk operations
BatchGetItem size100 items or 16 MBUse for multi-item fetches
Transaction size25 items or 4 MBAtomic multi-item operations
Attribute name length64 KBKeep names short (counts toward 400 KB)
Nested depth (Maps/Lists)32 levelsRarely a practical issue
Query/Scan response size1 MB per callPaginate with LastEvaluatedKey

The Limits That Bite

The 400 KB item limit and 10 GB partition limit are the ones that cause production incidents. Design for them from day one. The 1 MB response limit means you must always handle pagination. The 3,000 RCU / 1,000 WCU per partition limit means hot keys have a hard ceiling regardless of table-level capacity.

Migration Patterns

Migration Strategies

  • āœ…From relational to DynamoDB: access pattern analysis first, then model
  • āœ…Dual-write migration: write to both old DB and DynamoDB during transition
  • āœ…Backfill: export from old DB, bulk load via BatchWriteItem
  • āœ…Cutover: switch reads to DynamoDB, stop writes to old DB
  • āœ…Why migrations are hard: must rethink data model, not just move data
08

Interview Questions

Q:How does DynamoDB handle security without database-level users?

A: All access is controlled through AWS IAM policies. Fine-grained access control uses IAM conditions: dynamodb:LeadingKeys restricts access to items where the partition key matches the caller's identity (e.g., Cognito user ID). dynamodb:Attributes restricts which attributes can be read/written. No credentials are stored in application code — Lambda assumes an IAM role.

Q:What is DAX and when would you NOT use it?

A: DAX is an in-memory cache for DynamoDB with sub-millisecond reads and drop-in SDK compatibility. Don't use it when: you need strongly consistent reads (DAX is always eventually consistent), write-heavy workloads with few reads, you need fine-grained cache invalidation control, or cost is a concern (DAX clusters charge hourly regardless of usage).

Q:How do you detect and fix a hot partition?

A: Detection: CloudWatch ThrottledRequests metric (any non-zero = problem), Contributor Insights (shows exact hot keys). Fix: (1) Redesign partition key for higher cardinality, (2) Write sharding with random suffix, (3) DAX for read hot spots, (4) On-demand mode to reduce capacity-based throttling. The root cause is always key design — operational fixes are band-aids.

Q:What is the difference between on-demand backup and PITR?

A: On-demand backup: manual snapshot at a specific moment, stored indefinitely, good for pre-migration safety. PITR: continuous backup of the last 35 days, restore to any second within that window, good for accidental deletes or bad writes. Both restore to a new table (not in-place). PITR is more flexible but has a 35-day rolling window.

Q:What are the most important DynamoDB limits to design around?

A: 400 KB item size (keep items lean, large data in S3), 10 GB partition collection limit (time-bucket partition keys), 3000 RCU / 1000 WCU per partition (avoid hot keys), 1 MB response limit (always handle pagination), 25 items per transaction (batch complex operations). These limits are hard — hitting them causes failures, not degradation.

09

Common Mistakes

šŸ’¾

Not enabling PITR on production tables

Without PITR, an accidental DeleteItem or bad deployment that corrupts data is unrecoverable. PITR costs pennies per GB and provides 35-day recovery. Enable it on every production table — no exceptions.

āœ…Enable PITR on all production tables immediately. Cost is negligible compared to data loss risk.

šŸ“

Using DAX for write-heavy workloads

DAX is a read cache. Writes go through DAX to DynamoDB (write-through) but don't benefit from caching. If your workload is 90% writes, DAX adds cost and latency without benefit.

āœ…Use DAX only when read-to-write ratio is high. For write-heavy tables, focus on key design and capacity planning instead.

šŸ“Š

Not monitoring GSI capacity independently

GSIs have separate provisioned capacity. If a GSI is throttled, writes to the BASE TABLE fail. Many teams monitor only the base table metrics and miss GSI throttling as the root cause.

āœ…Set CloudWatch alarms on each GSI's ConsumedWriteCapacityUnits and ThrottledRequests independently.

🚧

Assuming DynamoDB limits are soft

The 400 KB item limit, 10 GB partition limit, and per-partition throughput limits are HARD limits. Hitting them causes immediate failures (not graceful degradation).

āœ…Design for these limits from day one. Add item size validation in application code. Monitor item collection sizes.

šŸ”“

Over-permissive IAM policies for DynamoDB access

Granting dynamodb:* on Resource: * gives full access to all tables. This violates least-privilege and creates security risk.

āœ…Use specific actions (GetItem, Query), specific table ARNs, and condition keys (LeadingKeys) to restrict access to relevant items only.