Core Architecture
Shards, nodes, clusters, and how Elasticsearch distributes data for scale, resilience, and near-real-time search.
Table of Contents
Documents & Indices
Everything in Elasticsearch starts with a document β a JSON object that represents a single unit of data. Documents are the atomic unit of indexing and search. Unlike rows in a relational database, documents are schema-flexible and self-contained.
# Every document has metadata fields: { "_index": "products", β which index this doc belongs to "_id": "abc123", β unique identifier (auto-generated or explicit) "_version": 3, β incremented on every update "_source": { β the actual JSON you indexed "name": "Running Shoes", "price": 129.99, "category": "footwear", "description": "Lightweight trail running shoes" } } Key properties: β’ Documents are JSON β nested objects, arrays, all valid β’ Documents are immutable β you cannot modify a document in place β’ Updates = delete old version + reindex new version (internally) β’ _id is unique within an index (not globally) β’ _source stores the original JSON (retrievable, but not searched directly)
π Immutability is Fundamental
When you "update" a document, Elasticsearch internally marks the old version as deleted and indexes a completely new document. This is because Lucene segments are immutable β once written, they never change. This design enables lock-free concurrent reads and makes crash recovery straightforward.
Indices β Collections of Documents
An index is a collection of documents that share similar characteristics. It's the top-level container β analogous to a database table, but more flexible. Each index has its own mapping (schema), settings (shard count, analyzers), and data.
Index: "products" βββ Mapping (schema): defines field types and analyzers βββ Settings: number_of_shards=5, number_of_replicas=1 βββ Aliases: "products-live" β "products-v2" βββ Documents: millions of product JSON objects Common patterns: β’ One index per entity type: products, users, orders β’ Time-based indices: logs-2024-01-15, logs-2024-01-16 β’ Versioned indices: products-v1, products-v2 (swap via alias) Index naming rules: β’ Lowercase only β’ No special characters (except hyphens and underscores) β’ Cannot start with - or _ β’ Max 255 characters
Index Aliases β Zero-Downtime Reindexing
An alias is a pointer to one or more indices. Clients query the alias, not the index directly. This lets you swap the underlying index without any client changes β essential for zero-downtime reindexing when you need to change mappings.
# Step 1: Create new index with updated mapping PUT /products-v2 { "mappings": { ... }, "settings": { ... } } # Step 2: Reindex all documents from old to new POST /_reindex { "source": { "index": "products-v1" }, "dest": { "index": "products-v2" } } # Step 3: Atomic alias swap (clients never notice) POST /_aliases { "actions": [ { "remove": { "index": "products-v1", "alias": "products" } }, { "add": { "index": "products-v2", "alias": "products" } } ] } # Clients always query "products" alias β zero downtime # Old index can be deleted after verification
When to Use Aliases
- β Zero-downtime reindexing when mappings change
- β Blue-green deployments for index upgrades
- β Filtering aliases β route queries to a subset of data
- β Time-based rollover β 'logs-current' always points to today's index
- β Multi-tenant isolation β each tenant alias points to their data
Shards
A shard is the fundamental unit of storage and parallelism in Elasticsearch. Every index is divided into one or more primary shards, and each shard is a self-contained Lucene index β a fully functional search engine that can index and serve queries independently.
The number of primary shards is fixed at index creation. You cannot add or remove primary shards later without reindexing. This is one of the most important capacity planning decisions you make.
Index: "products" β 5 primary shards, 1 replica Node 1 (data) Node 2 (data) Node 3 (data) ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β P0 β P3 β β P1 β P4 β β P2 β β β β β β β β β β R1 β R4 β β R2 β R0 β β R3 β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ P = Primary shard R = Replica shard β’ Each primary lives on exactly one node β’ Replicas are placed on DIFFERENT nodes than their primary β’ Queries can be served by either primary or replica β’ Writes always go to the primary first, then replicate Document routing: shard = hash(_routing) % number_of_primary_shards default _routing = document _id This is WHY you can't change shard count after creation β the routing formula would assign documents to wrong shards.
Shards as Filing Cabinet Drawers
An index is a filing cabinet. Each shard is a drawer. When you add a document, it goes into a specific drawer based on a hash of its ID. You can't add more drawers later because the filing system (hash formula) would break β documents would be in the wrong drawer. Each drawer is independent: it has its own internal organization (Lucene index) and can be searched in parallel with other drawers.
Why Sharding Exists
Benefits of Sharding
- β Horizontal scaling β distribute data across multiple nodes when a single node can't hold it all
- β Parallelism β search queries execute on all shards simultaneously, then results are merged
- β Throughput β more shards = more nodes can serve reads in parallel
- β Data locality β related documents can be routed to the same shard via custom routing
Shard Sizing Guidelines
Shard Size Rules of Thumb
- β Target 20-40 GB per shard for most use cases (search-heavy workloads)
- β Up to 50 GB per shard for logging/time-series (write-heavy, less search)
- β Minimum: avoid shards smaller than a few GB β overhead isn't worth it
- β Formula: number_of_shards = ceil(expected_data_size / 30GB)
- β Each shard consumes memory for segment metadata, field data, and caches
β οΈ The Over-Sharding Problem
Each shard has fixed overhead: file descriptors, memory for segment metadata, thread pool slots, and cluster state. A cluster with thousands of tiny shards (under 1 GB each) wastes resources on overhead instead of actual data. The rule: fewer, larger shards are better than many small ones. Aim for the fewest shards that keep each under 40 GB.
Replica Shards
A replica shard is an exact copy of a primary shard, stored on a different node. Replicas serve two purposes: high availability (if a node dies, the replica is promoted to primary) and read throughput (replicas can serve search queries in parallel with the primary).
Index: "orders" β 3 primaries, 2 replicas (number_of_replicas=2) Each primary shard has 2 copies β 3 primaries + 6 replicas = 9 total shards Node 1 Node 2 Node 3 ββββββββββββββ ββββββββββββββ ββββββββββββββ β P0 β β P1 β β P2 β β R1-copy1 β β R0-copy1 β β R0-copy2 β β R2-copy1 β β R2-copy2 β β R1-copy2 β ββββββββββββββ ββββββββββββββ ββββββββββββββ Write path: 1. Client sends index request β coordinating node 2. Routed to primary shard (P0 on Node 1) 3. Primary indexes document, writes to translog 4. Primary forwards to ALL replicas (R0-copy1, R0-copy2) 5. Replicas acknowledge β primary acknowledges client Read path: β’ Any copy (primary OR replica) can serve a read β’ Coordinating node round-robins across available copies β’ More replicas = more read throughput (linear scaling)
| Aspect | Primary Shard | Replica Shard |
|---|---|---|
| Writes | Handles all writes first | Receives writes from primary only |
| Reads | Can serve reads | Can serve reads (load balanced) |
| Count change | Fixed at index creation | Can change anytime (dynamic) |
| Node placement | Any data node | Must be on a DIFFERENT node than its primary |
| Promotion | N/A | Promoted to primary if original primary's node fails |
π‘ Disable Replicas During Bulk Load
When bulk-loading large amounts of data (initial index population, reindexing), set number_of_replicas: 0 temporarily. This avoids duplicating every write to replicas during the load. After the bulk load completes, set replicas back to your desired count β Elasticsearch will create them from the primaries. This can cut bulk indexing time in half.
Replica Configuration Guidelines
- β Minimum 1 replica for production (tolerates single node failure)
- β 2 replicas for critical data (tolerates 2 simultaneous node failures)
- β 0 replicas only for dev/test or during bulk loads
- β More replicas = more read throughput but more disk usage and write latency
- β Single-node clusters: replicas stay unassigned (yellow health) β this is expected
Nodes & Roles
A node is a single Elasticsearch instance (one JVM process). Multiple nodes form a cluster. Each node can have one or more roles that determine what work it performs. In small clusters, nodes wear multiple hats. In large clusters, dedicated role assignment prevents resource contention.
| Role | Responsibility | When to Dedicate |
|---|---|---|
| master-eligible | Manages cluster state: index creation, shard allocation, node tracking. Only one active master at a time (elected). | 3+ dedicated master nodes in production to survive split-brain |
| data | Stores shards, executes CRUD operations, search queries, and aggregations. The workhorse. | Always β most nodes are data nodes. Scale horizontally for capacity. |
| ingest | Runs ingest pipelines (transform documents before indexing): grok, date parsing, enrichment. | When pipelines are CPU-heavy and you don't want to impact search latency. |
| coordinating (no role) | Routes requests, scatters queries to shards, gathers and merges results. Every node does this, but dedicated ones only do this. | High query volume with heavy aggregations β offloads merge work from data nodes. |
| ml | Runs machine learning jobs (anomaly detection, inference). | When ML jobs compete with search for CPU/memory. |
# Dedicated master node (no data, no ingest) node.roles: [master] # Lightweight β needs minimal CPU/disk, moderate RAM for cluster state # Dedicated data node (the workhorse) node.roles: [data] # Heavy β needs lots of disk, RAM, and CPU for indexing/search # Dedicated ingest node node.roles: [ingest] # Moderate β CPU for pipeline processing # Coordinating-only node (no roles = coordinating only) node.roles: [] # Acts as a smart load balancer β scatters/gathers queries # Small cluster (mixed roles β all nodes do everything) node.roles: [master, data, ingest] # Fine for < 5 nodes, but doesn't scale well Production topology (typical): 3 dedicated master nodes (m5.large β small, stable) N dedicated data nodes (i3.2xlarge β big disk, lots of RAM) 2 coordinating-only nodes (c5.xlarge β CPU for merging results) Optional: 1-2 ingest nodes (c5.xlarge β CPU for pipelines)
π Master Election & Split-Brain
Elasticsearch uses a quorum-based master election. With 3 master-eligible nodes, a majority (2) must agree on the active master. This prevents split-brain β where two halves of a network partition each elect their own master and diverge. Always use an odd number of master-eligible nodes (3 or 5). Never use 2 β there is no majority possible if they disagree.
Cluster Health
Cluster health is a traffic-light indicator of shard allocation status. It tells you whether all shards are properly assigned and replicated. This is the first thing you check when troubleshooting any Elasticsearch issue.
| Status | Meaning | Action Required |
|---|---|---|
| π’ Green | All primary shards AND all replica shards are allocated. Full redundancy. | None β everything is healthy. |
| π‘ Yellow | All primary shards are allocated, but some replicas are NOT. Data is accessible but not fully redundant. | Investigate: usually means not enough nodes to place replicas (e.g., single-node cluster with replicas=1). |
| π΄ Red | Some primary shards are NOT allocated. Data loss or unavailability for those shards. | Urgent: identify unassigned primaries. Could be disk full, node crash, or corruption. |
GET /_cluster/health { "cluster_name": "production", "status": "yellow", "timed_out": false, "number_of_nodes": 5, "number_of_data_nodes": 3, "active_primary_shards": 42, "active_shards": 78, "relocating_shards": 0, "initializing_shards": 0, "unassigned_shards": 6, β these are the problem "delayed_unassigned_shards": 0, "number_of_pending_tasks": 0, "number_of_in_flight_fetch": 0, "task_max_waiting_in_queue_millis": 0, "active_shards_percent_as_number": 92.86 } # Diagnosis: 6 unassigned shards (replicas, since status is yellow not red) # Likely cause: only 3 data nodes but some indices need replicas on 4+ nodes # To find which shards are unassigned: GET /_cat/shards?v&h=index,shard,prirep,state,unassigned.reason&s=state # To see why a shard can't be allocated: GET /_cluster/allocation/explain
π‘ Yellow is Not Always a Problem
A single-node development cluster will always be yellow because replicas cannot be placed on the same node as their primary. This is expected and safe for dev. In production, yellow means you have reduced redundancy β one more node failure could cause data loss. Treat yellow as a warning that needs investigation, not an emergency.
Common Causes of Unassigned Shards
- βNot enough nodes to satisfy replica placement rules (replica can't be on same node as primary)
- βDisk watermark exceeded β node is too full (default: 85% high watermark)
- βShard allocation filtering β node attributes exclude available nodes
- βNode left cluster β shards waiting for delayed allocation timeout
- βIndex created with more replicas than available nodes minus one
Segments & Near-Real-Time
Elasticsearch is built on Apache Lucene, and Lucene stores data in segments β immutable files that contain a portion of the inverted index. Understanding segments explains why ES is "near-real-time" rather than real-time, and why updates are expensive.
Document indexed β In-memory buffer β [refresh] β New segment (searchable) β Translog (durability) β [flush] β Segment on disk, translog cleared Timeline of a write: t=0ms Document arrives, written to in-memory buffer + translog t=0ms Document is NOT yet searchable t=1000ms Refresh happens (every 1 second by default) β In-memory buffer β new immutable segment β Document is NOW searchable t=30min Flush happens (or translog size threshold) β Segment fsync'd to disk β Translog cleared This 1-second gap is why ES is "near-real-time": β’ You index a document at t=0 β’ It becomes searchable at t=1s (after next refresh) β’ NOT instantly searchable like a database INSERT
The Whiteboard and the Notebook
Imagine you're taking notes. You first write on a whiteboard (in-memory buffer) β fast but temporary. Every minute, you photograph the whiteboard and print the photo (refresh β new segment). The printed photo is permanent and everyone can read it. But there's a delay between writing on the whiteboard and the photo being available. That delay is the refresh interval. The translog is like a voice recorder running the whole time β if the whiteboard gets erased (crash), you can replay the recording to recover.
Segment Merging
Each refresh creates a new segment. Over time, you accumulate many small segments. Lucene periodically merges smaller segments into larger ones in the background. This reduces the number of files to search and reclaims space from deleted documents (which are only marked as deleted until a merge removes them).
Before merge: Shard P0: [seg-1: 50MB] [seg-2: 12MB] [seg-3: 8MB] [seg-4: 3MB] [seg-5: 1MB] 5 segments, some with deleted docs marked After merge (background process): Shard P0: [seg-1: 50MB] [seg-6: 22MB] 2 segments, deleted docs physically removed Merge behavior: β’ Automatic β Lucene's TieredMergePolicy handles it β’ Merges small segments into larger ones β’ Physically removes deleted documents (reclaims disk) β’ CPU and I/O intensive β runs in background threads β’ max_num_segments: 1 via _forcemerge (only for read-only indices!) Key insight: "deleting" a document just marks it in a bitset. The bytes aren't reclaimed until a merge happens.
Segment Key Facts
- β Segments are immutable β once written, never modified (enables lock-free reads)
- β Each segment has its own inverted index, stored fields, and doc values
- β More segments = slower searches (must check each segment and merge results)
- β Refresh creates a new segment β frequent refreshes = many small segments
- β Merges are automatic but can be forced with _forcemerge on read-only indices
- β Translog provides durability between flushes β replayed on crash recovery
Indexing Internals
Understanding how documents become searchable helps you tune Elasticsearch for write-heavy workloads. The path from API call to searchable document involves several stages, each with tuning knobs.
Client β Coordinating Node β Primary Shard β Replica Shards Detailed flow on the primary shard: 1. Document arrives via Bulk API or single index request 2. Ingest pipeline runs (if configured): enrichment, parsing 3. Document is analyzed: text fields β tokens via analyzer 4. Written to in-memory buffer (Lucene IndexWriter) 5. Written to translog (append-only, sequential write β fast) 6. Acknowledgment sent to client (document is durable but NOT searchable) 7. [After refresh_interval] Buffer β new Lucene segment (now searchable) 8. [After flush] Segment fsync'd, translog cleared Performance levers: β’ refresh_interval: "30s" β less frequent refresh = fewer segments = faster indexing β’ refresh_interval: "-1" β disable refresh entirely during bulk load β’ translog.durability: "async" β don't fsync translog on every request (faster, less safe) β’ number_of_replicas: 0 β no replica writes during bulk load
Bulk API β The Right Way to Index
POST /_bulk {"index": {"_index": "products", "_id": "1"}} {"name": "Running Shoes", "price": 129.99} {"index": {"_index": "products", "_id": "2"}} {"name": "Trail Boots", "price": 189.99} {"index": {"_index": "products", "_id": "3"}} {"name": "Sandals", "price": 49.99} Why bulk over single requests: β’ Single: 1000 documents = 1000 HTTP round trips β’ Bulk: 1000 documents = 1 HTTP request β’ 10-100x faster for batch operations Bulk sizing guidelines: β’ Start with 5-15 MB per bulk request β’ Too small: overhead per request dominates β’ Too large: memory pressure, long GC pauses β’ Measure and adjust: watch for rejected requests (thread pool full) β’ Optimal batch size varies by document size and cluster capacity
Force Merge β For Read-Only Indices
# Merge all segments in an index down to 1 segment POST /logs-2024-01-15/_forcemerge?max_num_segments=1 # When to use: # β Time-based indices that are no longer receiving writes (yesterday's logs) # β After bulk load + setting index to read-only # β NEVER on actively-written indices (blocks new segments, wastes I/O) # Benefits of force merge on read-only indices: # β’ Fewer segments = faster searches (less merging at query time) # β’ Deleted documents physically removed (reclaims disk) # β’ Reduced file descriptor usage # β’ Better OS page cache utilization
π‘ Bulk Load Recipe
For maximum indexing throughput during initial data load: 1. Set number_of_replicas: 0 (no replica overhead) 2. Set refresh_interval: "-1" (no refresh overhead) 3. Use Bulk API with 5-15 MB batches 4. After load: restore replicas, set refresh_interval back to "1s" 5. Run _forcemerge?max_num_segments=1 if index is now read-only. This can improve bulk indexing speed by 3-5x.
Interview Questions
Q:How do you decide the number of primary shards for an index?
A: Estimate total data size and divide by target shard size (20-40 GB). For example, 200 GB of data β 5-10 primary shards. Also consider query parallelism: more shards = more parallel search threads. But avoid over-sharding β each shard has fixed overhead (memory, file descriptors, cluster state). You cannot change primary shard count after creation without reindexing, so slightly over-provisioning is safer. For time-based indices (logs), use ILM rollover based on size rather than guessing upfront.
Q:Explain the difference between green, yellow, and red cluster health.
A: Green: all primary and replica shards are allocated β full redundancy. Yellow: all primaries are allocated but some replicas are not β data is accessible but not fully redundant (one more failure could cause loss). Red: some primary shards are unallocated β those shards' data is unavailable. Common causes: yellow = not enough nodes for replica placement or disk watermark exceeded. Red = node crash with no replica available, or disk corruption. Always investigate yellow promptly; red is an emergency.
Q:Why are updates expensive in Elasticsearch?
A: Documents are stored in immutable Lucene segments. You cannot modify a segment in place. An 'update' internally: (1) retrieves the current document, (2) marks the old version as deleted in a bitset, (3) indexes the entire new version as a fresh document in a new segment. The old document's disk space isn't reclaimed until a segment merge happens. This means updates have the cost of a delete + a full reindex of the document. Partial updates (_update API) still reindex the full document internally β they just save you a network round trip for the read.
Q:How does Elasticsearch achieve near-real-time search?
A: When a document is indexed, it's written to an in-memory buffer and translog. It's NOT immediately searchable. Every 1 second (default refresh_interval), the buffer is flushed to a new Lucene segment β an operation called 'refresh'. Only after refresh is the document visible to search queries. This 1-second delay is why ES is 'near-real-time' not real-time. You can call _refresh manually for immediate visibility, but frequent manual refreshes hurt performance. The translog ensures durability even before the segment is fsync'd to disk.
Q:What happens when a data node leaves the cluster?
A: The master node detects the departure (via cluster health monitoring). It waits for a configurable delay (index.unassigned.node_left.delayed_timeout, default 1 minute) in case the node returns quickly. After the delay, the master promotes replica shards on other nodes to primary status for any primaries that were on the lost node. It then allocates new replicas on remaining nodes to restore the configured replication factor. During this process, cluster health may be yellow (replicas missing) or red (if no replica existed for a lost primary).
Common Mistakes
Over-sharding: too many small shards
Creating indices with 20+ primary shards when the data is only a few GB. Each shard consumes memory for segment metadata, file descriptors, and thread pool slots. A cluster with thousands of tiny shards spends more resources on overhead than actual search.
β Target 20-40 GB per shard. A 10 GB index needs 1 primary shard, not 5. Use the _cat/shards API to audit shard sizes. For time-based indices, use ILM rollover policies based on shard size rather than fixed shard counts.
Too many small indices without rollover
Creating a new index per day (logs-2024-01-01, logs-2024-01-02) each with 5 shards. After a year you have 1,825 shards from 365 indices, most holding under 1 GB. Cluster state bloats, master node struggles.
β Use Index Lifecycle Management (ILM) with rollover based on size (e.g., roll when index reaches 30 GB). This creates fewer, larger indices. Combine with aliases so clients always write to 'logs-current' regardless of the underlying index name.
Not using index aliases
Clients query index names directly (products-v1). When you need to reindex (mapping change, shard count change), you must update every client to point to products-v2. Downtime is inevitable.
β Always use aliases. Clients query 'products' (the alias). When you reindex to products-v2, atomically swap the alias. Zero downtime, zero client changes. Set this up from day one β retrofitting aliases is painful.
Ignoring yellow cluster health
Treating yellow as 'good enough' in production. Yellow means replicas are unassigned β you have zero redundancy for those shards. One more node failure means data loss. Teams often ignore yellow for weeks until a node actually dies.
β Investigate yellow immediately. Common fixes: add more data nodes, reduce replica count to match available nodes, or increase disk space (watermark issues). Set up alerts on cluster health status changes. Yellow in production should be treated as a P2 incident.
Running _forcemerge on active indices
Calling _forcemerge on indices that are still receiving writes. This triggers expensive merge operations that compete with indexing, causes high I/O, and the merged segments immediately get fragmented again by new writes.
β Only _forcemerge indices that are read-only (no more writes). For time-based indices, forcemerge yesterday's index after rollover. Set the index to read-only first (index.blocks.write: true) to prevent accidental writes during and after the merge.