ILMVector SearchPerformanceSnapshotsES vs AlternativeskNNData StreamsMonitoring

Operations & Tuning

Index Lifecycle Management, vector search, performance tuning, snapshots, monitoring, and knowing when Elasticsearch is NOT the right choice.

35 min read9 sections
01

Index Lifecycle Management (ILM)

Time-series data (logs, metrics, events) grows indefinitely. You can't keep everything on fast SSDs forever. ILM automates the lifecycle of indices β€” moving data through phases based on age or size, optimizing cost and performance at each stage.

ILM Phasestext
Phase Flow:

  HOT β†’ WARM β†’ COLD β†’ FROZEN β†’ DELETE

  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚   HOT   │───▢│  WARM   │───▢│  COLD   │───▢│ FROZEN  │───▢│ DELETE  β”‚
  β”‚         β”‚    β”‚         β”‚    β”‚         β”‚    β”‚         β”‚    β”‚         β”‚
  β”‚ Active  β”‚    β”‚ No new  β”‚    β”‚ Rarely  β”‚    β”‚ Cheapestβ”‚    β”‚ Remove  β”‚
  β”‚ writes  β”‚    β”‚ writes  β”‚    β”‚ queried β”‚    β”‚ storage β”‚    β”‚ entirelyβ”‚
  β”‚ Fast SSDβ”‚    β”‚ Warm SSDβ”‚    β”‚ HDD     β”‚    β”‚ Snapshotβ”‚    β”‚         β”‚
  β”‚ Full    β”‚    β”‚ Shrink  β”‚    β”‚ Freeze  β”‚    β”‚ mount   β”‚    β”‚         β”‚
  β”‚ replicasβ”‚    β”‚ replicasβ”‚    β”‚ replicasβ”‚    β”‚         β”‚    β”‚         β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

  Typical timeline for logs:
    HOT:    0-3 days   (actively written, frequently searched)
    WARM:   3-30 days  (read-only, occasionally searched)
    COLD:   30-90 days (rarely accessed, compliance retention)
    FROZEN: 90-365 days (searchable snapshots, near-zero cost)
    DELETE: >365 days  (removed entirely)
ILM Policy Definitiontext
PUT _ilm/policy/logs-policy
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_primary_shard_size": "50gb",
            "max_age": "1d"
          },
          "set_priority": { "priority": 100 }
        }
      },
      "warm": {
        "min_age": "3d",
        "actions": {
          "shrink": { "number_of_shards": 1 },
          "forcemerge": { "max_num_segments": 1 },
          "set_priority": { "priority": 50 },
          "allocate": {
            "require": { "data": "warm" }
          }
        }
      },
      "cold": {
        "min_age": "30d",
        "actions": {
          "allocate": {
            "require": { "data": "cold" }
          },
          "set_priority": { "priority": 0 }
        }
      },
      "frozen": {
        "min_age": "90d",
        "actions": {
          "searchable_snapshot": {
            "snapshot_repository": "my-s3-repo"
          }
        }
      },
      "delete": {
        "min_age": "365d",
        "actions": { "delete": {} }
      }
    }
  }
}

Rollover & Data Streams

Rollover automatically creates a new index when the current one hits a size or age threshold. Data streams are the modern abstraction β€” they manage a series of backing indices behind a single name, with automatic rollover built in.

Data Streamstext
# Data stream = append-only time-series abstraction
# Behind the scenes: a series of backing indices

Data Stream: "logs-nginx"
  β”œβ”€β”€ .ds-logs-nginx-2024.01.01-000001  (oldest, cold)
  β”œβ”€β”€ .ds-logs-nginx-2024.01.15-000002  (warm)
  β”œβ”€β”€ .ds-logs-nginx-2024.01.28-000003  (warm)
  └── .ds-logs-nginx-2024.02.10-000004  (current write index, hot)

# Writes always go to the latest backing index
# Reads span all backing indices transparently
# Rollover creates a new backing index automatically

# Create an index template for the data stream:
PUT _index_template/logs-template
{
  "index_patterns": ["logs-*"],
  "data_stream": {},
  "template": {
    "settings": {
      "index.lifecycle.name": "logs-policy",
      "number_of_shards": 3,
      "number_of_replicas": 1
    },
    "mappings": {
      "properties": {
        "@timestamp": { "type": "date" },
        "message": { "type": "text" },
        "level": { "type": "keyword" },
        "service": { "type": "keyword" }
      }
    }
  }
}

🎯 Data Streams Are the Modern Way

For any time-series data (logs, metrics, traces), use data streams instead of manually managing index aliases and rollover. Data streams enforce append-only semantics, handle rollover automatically, and integrate cleanly with ILM policies.

02

Vector & Semantic Search

Traditional search matches keywords β€” if the user searches "affordable laptop" but the document says "budget notebook," BM25 won't find it. Vector search encodes meaning into dense embeddings, enabling semantic similarity matching. Elasticsearch supports both and can combine them.

dense_vector Field & kNN Searchtext
# Step 1: Define a dense_vector field in your mapping
PUT /products
{
  "mappings": {
    "properties": {
      "title": { "type": "text" },
      "description": { "type": "text" },
      "embedding": {
        "type": "dense_vector",
        "dims": 768,
        "index": true,
        "similarity": "cosine"
      }
    }
  }
}

# Step 2: Index documents with embeddings
# (embeddings generated by a model like sentence-transformers)
POST /products/_doc/1
{
  "title": "Budget Notebook Computer",
  "description": "Lightweight laptop for everyday use",
  "embedding": [0.12, -0.34, 0.56, ...]  // 768-dimensional vector
}

# Step 3: kNN query β€” find semantically similar documents
POST /products/_search
{
  "knn": {
    "field": "embedding",
    "query_vector": [0.11, -0.33, 0.55, ...],  // query embedding
    "k": 10,
    "num_candidates": 100
  }
}

# How HNSW works (the index structure):
#   - Hierarchical Navigable Small World graph
#   - Multi-layer graph: top layers = long-range links, bottom = fine-grained
#   - Search starts at top layer, greedily descends to nearest neighbors
#   - Approximate: trades perfect recall for speed (configurable)
#   - Parameters: m (connections per node), ef_construction (build quality)

Hybrid Search β€” BM25 + Vector with RRF

Pure keyword search misses semantic matches. Pure vector search misses exact keyword matches (e.g., product SKUs, error codes). Hybrid search combines both using Reciprocal Rank Fusion (RRF) to merge ranked results from each approach.

Hybrid Search with RRFtext
POST /products/_search
{
  "query": {
    "match": {
      "description": "affordable lightweight laptop"
    }
  },
  "knn": {
    "field": "embedding",
    "query_vector": [0.11, -0.33, 0.55, ...],
    "k": 10,
    "num_candidates": 100
  },
  "rank": {
    "rrf": {
      "window_size": 100,
      "rank_constant": 60
    }
  }
}

# How RRF works:
#   For each document, compute: score = Ξ£ 1/(rank_constant + rank_i)
#   where rank_i is the document's rank in each result set
#
#   Example:
#     Doc A: rank 1 in BM25, rank 5 in kNN
#     Doc B: rank 3 in BM25, rank 1 in kNN
#
#     Score A = 1/(60+1) + 1/(60+5) = 0.0164 + 0.0154 = 0.0318
#     Score B = 1/(60+3) + 1/(60+1) = 0.0159 + 0.0164 = 0.0323
#     Doc B wins (strong in both)

# When hybrid beats pure approaches:
#   - "error code NX-4012 memory leak" β†’ keyword matches the code, vector
#     matches the concept
#   - "cheap macbook alternative" β†’ vector understands "cheap" = "affordable",
#     keyword catches "macbook" exactly
🧠

Two Librarians Working Together

Keyword search is like a librarian who only matches exact words in the catalog. Vector search is like a librarian who understands what you mean even if you use different words. Hybrid search asks both librarians, then picks books that both recommend highly β€” giving you the best of exact matching and semantic understanding.

πŸ’‘ Embedding Model Choice Matters

The quality of vector search depends entirely on the embedding model. General-purpose models (sentence-transformers, OpenAI embeddings) work for broad content. Domain-specific fine-tuned models dramatically outperform them for specialized content (medical, legal, e-commerce). Always evaluate retrieval quality with your actual data.

03

Index Design Patterns

How you structure indices determines performance, scalability, and operational complexity. Unlike relational databases, Elasticsearch has no joins β€” your index design must account for query patterns upfront.

Index Design Strategies

  • βœ…Time-based indices (logs-2024.01.15): natural ILM alignment, easy to delete old data, queries target specific time ranges
  • βœ…Index per tenant: strong isolation, independent scaling, simple access control β€” but operational overhead grows with tenant count
  • βœ…Single index with tenant field: simpler ops, use filtered aliases for tenant isolation β€” but noisy neighbor risk
  • βœ…Denormalize aggressively: ES has no joins β€” embed related data into documents at index time (e.g., order contains full product info)
  • βœ…Nested objects: use when array elements must be queried independently (e.g., 'find orders where item.color=red AND item.size=large' on the same item)
  • βœ…Flattened fields: use for high-cardinality dynamic keys (e.g., user-defined labels) to prevent mapping explosion
Zero-Downtime Reindex with Aliasestext
# Problem: You need to change a field mapping, but mappings are immutable.
# Solution: Create a new index, reindex data, swap the alias atomically.

# Step 1: Your app always queries the alias, never the real index name
#   App β†’ "products" (alias) β†’ products-v1 (real index)

# Step 2: Create new index with updated mapping
PUT /products-v2
{
  "mappings": { ... updated mapping ... }
}

# Step 3: Reindex data from old to new
POST /_reindex
{
  "source": { "index": "products-v1" },
  "dest": { "index": "products-v2" }
}

# Step 4: Atomic alias swap (zero downtime)
POST /_aliases
{
  "actions": [
    { "remove": { "index": "products-v1", "alias": "products" } },
    { "add": { "index": "products-v2", "alias": "products" } }
  ]
}

# Step 5: Delete old index when confident
DELETE /products-v1

# The app never knew anything changed β€” it always queries "products"

🎯 Always Use Aliases

Never let applications query index names directly. Always use aliases. This gives you the freedom to reindex, split, shrink, or restructure indices without any application changes. It's the ES equivalent of a database view.

04

Performance Tuning

Elasticsearch performance tuning splits into two domains: indexing throughput (how fast you can write) and search latency (how fast you can read). They often trade off against each other.

Indexing Performance

TechniqueWhat It DoesImpact
Bulk APIBatch multiple index/update/delete operations in one request10-100x faster than individual requests β€” always use bulk for batch loads
refresh_interval: 30sIncrease from default 1s during bulk loadsFewer segment creates = faster indexing (data visible with delay)
number_of_replicas: 0Disable replicas during initial bulk loadHalves write work β€” re-enable after load completes
Disable swappingbootstrap.memory_lock: truePrevents JVM heap from being swapped to disk (catastrophic for latency)
Translog flush thresholdIncrease index.translog.flush_threshold_sizeFewer fsyncs during heavy indexing β€” slight durability trade-off
Mapping: index falseSet index: false on fields you never searchSaves CPU and disk β€” field is stored but not indexed

Search Performance

TechniqueWhat It DoesWhen to Use
Filter contextFilters are cached as bitsets, not scoredAlways use filter for yes/no conditions (status, date ranges, tenant ID)
Avoid leading wildcardsquery: '*phone' forces scan of all termsUse edge n-grams or reverse field instead of leading wildcard queries
search_afterCursor-based pagination using sort valuesDeep pagination (page 1000+) β€” from/size degrades linearly
Profile API_search with profile: true shows time per query phaseDiagnosing slow queries β€” identifies which clause is expensive
RoutingDirect queries to specific shardsMulti-tenant: route by tenant_id so queries hit 1 shard instead of all
Shard sizingTarget 10-50 GB per shardToo many small shards = overhead; too few large shards = slow queries
search_after β€” Deep Paginationtext
# Problem: from: 10000, size: 10 requires coordinating node to
# fetch 10010 docs from EACH shard, then discard 10000. Extremely wasteful.

# Solution: search_after uses the sort values of the last result as cursor

# First page:
POST /logs/_search
{
  "size": 100,
  "sort": [
    { "@timestamp": "desc" },
    { "_id": "asc" }
  ],
  "query": { "match": { "level": "error" } }
}

# Next page β€” pass sort values from last hit:
POST /logs/_search
{
  "size": 100,
  "sort": [
    { "@timestamp": "desc" },
    { "_id": "asc" }
  ],
  "query": { "match": { "level": "error" } },
  "search_after": [1707523200000, "doc_abc123"]
}

# Each shard only returns docs AFTER the cursor β€” no wasted work
# Consistent performance regardless of page depth
05

Memory & Resource Management

Elasticsearch runs on the JVM but relies heavily on the OS page cache for Lucene segment reads. Getting the memory split wrong is one of the most common causes of cluster instability.

Memory Architecturetext
Total Server RAM: 64 GB (example)

  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚                    64 GB Total RAM                       β”‚
  β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
  β”‚   JVM Heap: 31 GB   β”‚   OS Page Cache: ~31 GB          β”‚
  β”‚   (ES internals)    β”‚   (Lucene segment files)         β”‚
  β”‚                     β”‚                                  β”‚
  β”‚   β€’ Field data      β”‚   β€’ Segment reads (search)       β”‚
  β”‚   β€’ Node query      β”‚   β€’ Merges                       β”‚
  β”‚     cache           β”‚   β€’ Stored fields                β”‚
  β”‚   β€’ Indexing buffer β”‚   β€’ Doc values                   β”‚
  β”‚   β€’ Cluster state   β”‚   β€’ Term dictionaries            β”‚
  β”‚   β€’ Aggregations    β”‚                                  β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

  Rules:
    1. JVM heap ≀ 50% of RAM (leave rest for page cache)
    2. JVM heap ≀ 31 GB (compressed oops threshold)
    3. Both Xms and Xmx must be equal (avoid resize pauses)
    4. Never disable swap β€” use memory_lock instead

πŸ’‘ The 32 GB Compressed Oops Boundary

The JVM uses "compressed ordinary object pointers" (oops) when heap is below ~32 GB. This lets 4-byte pointers address 32 GB of memory. Above 32 GB, pointers expand to 8 bytes β€” effectively wasting ~30% of heap on pointer overhead. A 31 GB heap often outperforms a 40 GB heap. Never set heap between 32-40 GB.

ResourcePurposeTuning Guidance
JVM HeapES internal data structures, caches, aggregation buffersSet to min(50% RAM, 31 GB). Equal Xms/Xmx. G1GC for heaps > 8 GB.
OS Page CacheCaches Lucene segment files for fast readsLeave at least 50% RAM for this. More = faster searches.
Field Data CacheIn-memory uninverted index for text field aggregationsAVOID β€” use keyword fields or doc_values instead. Set indices.fielddata.cache.size limit.
Node Query CacheCaches filter clause results as bitsetsDefault 10% heap. Effective for repeated filters (tenant_id, status).
Indexing BufferBuffers new documents before creating segmentsDefault 10% heap. Increase for heavy indexing workloads.
Circuit BreakersPrevent OOM by rejecting requests that would exceed limitsParent breaker: 95% heap. Don't disable β€” they protect cluster stability.
06

Monitoring & Reliability

Elasticsearch clusters degrade silently before they fail loudly. Monitoring the right metrics gives you early warning. Snapshots give you recovery when things go wrong.

MetricWhat It MeansAlert ThresholdAction
Cluster Healthgreen/yellow/red β€” shard allocation statusYellow > 5 min, Red immediatelyYellow = unassigned replicas. Red = unassigned primaries (data loss risk).
JVM Heap UsagePercentage of heap in use> 85% sustainedFrequent GC, risk of OOM. Reduce caches, add nodes, or increase heap (up to 31 GB).
GC TimeTime spent in garbage collection> 500ms per collection, or > 5% of time in GCLong GC pauses cause node timeouts. Check for field data, large aggregations.
Search Latency (p99)99th percentile search response timeDepends on SLA (e.g., > 500ms)Check slow log, profile queries, verify shard sizing.
Indexing RateDocuments indexed per secondSudden drop or spikeDrop = upstream issue or rejections. Spike = bulk load affecting search.
Thread Pool RejectionsRequests rejected due to full queue> 0 (search or write rejections)Cluster is overloaded. Scale out or reduce request rate.
Disk WatermarksLow (85%), High (90%), Flood (95%)Approaching low watermarkAt high: no new shards allocated. At flood: index set to read-only.

Snapshot & Restore

Snapshot Configurationtext
# Register a snapshot repository (S3 example)
PUT /_snapshot/my-s3-repo
{
  "type": "s3",
  "settings": {
    "bucket": "my-es-backups",
    "region": "us-east-1",
    "base_path": "elasticsearch/snapshots"
  }
}

# Create a snapshot (incremental β€” only new/changed segments)
PUT /_snapshot/my-s3-repo/snapshot-2024-02-10
{
  "indices": "logs-*,products",
  "ignore_unavailable": true,
  "include_global_state": false
}

# Automate with SLM (Snapshot Lifecycle Management)
PUT /_slm/policy/nightly-snapshots
{
  "schedule": "0 30 2 * * ?",
  "name": "<nightly-snap-{now/d}>",
  "repository": "my-s3-repo",
  "config": {
    "indices": ["*"],
    "ignore_unavailable": true
  },
  "retention": {
    "expire_after": "30d",
    "min_count": 5,
    "max_count": 50
  }
}

# Restore a specific index from snapshot
POST /_snapshot/my-s3-repo/snapshot-2024-02-10/_restore
{
  "indices": "products",
  "rename_pattern": "(.+)",
  "rename_replacement": "restored-$1"
}

Slow Log

Slow Log Configurationtext
# Enable slow log for an index
PUT /products/_settings
{
  "index.search.slowlog.threshold.query.warn": "5s",
  "index.search.slowlog.threshold.query.info": "2s",
  "index.search.slowlog.threshold.fetch.warn": "1s",
  "index.indexing.slowlog.threshold.index.warn": "10s",
  "index.indexing.slowlog.threshold.index.info": "5s"
}

# Slow log output includes:
#   - The full query that was slow
#   - Which shard it ran on
#   - Total time breakdown (query phase, fetch phase)
#   - Number of hits

# Use slow log to:
#   1. Identify expensive queries in production
#   2. Find queries that need optimization (wildcards, deep aggs)
#   3. Detect shard hotspots (one shard consistently slow)

🎯 Shard Allocation Awareness

Use shard allocation awareness to spread replicas across failure domains (availability zones, racks). This ensures that losing one zone doesn't lose both primary and replica of the same shard. Configure with cluster.routing.allocation.awareness.attributes.

07

Security

Elasticsearch clusters exposed without security have been the source of numerous data breaches. Since version 8.0, security is enabled by default β€” but understanding the layers is essential.

LayerMechanismScope
Encryption in transitTLS for HTTP (client→node) and transport (node→node)Prevents eavesdropping and MITM attacks on cluster traffic
AuthenticationNative realm, LDAP, Active Directory, SAML, OIDC, PKI (mTLS)Verifies identity of users and services connecting to the cluster
Authorization (RBAC)Roles with index, cluster, and field-level privilegesControls what authenticated users can do
Field-level securityRoles can restrict which fields a user can seee.g., support team can see order status but not payment details
Document-level securityRoles include a query filter β€” users only see matching docse.g., tenant_id filter ensures each tenant sees only their data
API KeysScoped, time-limited keys for service-to-service authPreferred over username/password for applications
RBAC Exampletext
# Create a role with index-level and field-level security
POST /_security/role/support_agent
{
  "indices": [
    {
      "names": ["orders-*"],
      "privileges": ["read"],
      "field_security": {
        "grant": ["order_id", "status", "customer_name", "created_at"],
        "except": ["payment_card", "billing_address"]
      },
      "query": {
        "term": { "region": "us-east" }
      }
    }
  ]
}

# Create an API key for a microservice
POST /_security/api_key
{
  "name": "order-service-key",
  "expiration": "30d",
  "role_descriptors": {
    "order_writer": {
      "indices": [
        {
          "names": ["orders-*"],
          "privileges": ["write", "create_index"]
        }
      ]
    }
  }
}
08

Elasticsearch vs Alternatives

Elasticsearch is powerful but not always the right tool. Understanding when alternatives are better prevents over-engineering and reduces operational burden.

ComparisonES StrengthAlternative StrengthChoose Alternative When
ES vs PostgreSQL FTSDistributed, custom analyzers, relevance tuning, fuzzy matching, aggregationsNo extra infra, ACID transactions, simpler ops, good enough for basic search< 1M docs, simple search needs, already using Postgres, can't justify another system
ES vs SolrBetter REST API, easier clustering, faster innovation, richer ecosystem (ELK)Mature, battle-tested, better for static collections, strong XML/facetingExisting Solr investment, static document collections, Hadoop integration needed
ES vs Pinecone/WeaviateHybrid search (keyword + vector), full-text capabilities, existing ecosystemPurpose-built for vectors, simpler API, managed scaling, better recall at scalePure semantic search, no keyword needs, want managed service, billions of vectors
ES vs ClickHouseFull-text search, fuzzy matching, complex query DSL10-100x faster for analytical queries, columnar storage, SQL interfaceAnalytics/OLAP workload, aggregations over structured data, no full-text needs
ES vs Loki (logs)Rich query language, full-text search across logs, aggregations10x cheaper storage, label-based indexing, native Grafana integration, simpler opsCost-sensitive log storage, label-based filtering is sufficient, Grafana stack

When ES Is Overkill

Skip Elasticsearch When

  • ❌Simple LIKE queries on < 1M rows β€” PostgreSQL full-text search or trigram index handles this fine
  • ❌Pure analytics/dashboards on structured data β€” ClickHouse or BigQuery are 10-100x faster and cheaper
  • ❌You only need log grep with labels β€” Loki + Grafana costs a fraction of ELK
  • ❌Pure vector/embedding search β€” Pinecone, Weaviate, or pgvector are simpler and purpose-built
  • ❌You need ACID transactions β€” ES is eventually consistent, not a primary database
  • ❌Team lacks ES operational expertise β€” the learning curve and ops burden are significant

Elasticsearch Is Ideal When

  • βœ…Full-text search with relevance ranking, fuzzy matching, synonyms, and custom analyzers
  • βœ…Hybrid search combining keyword (BM25) and vector (kNN) approaches
  • βœ…Real-time log analytics with complex queries across terabytes of data
  • βœ…Autocomplete/typeahead with edge n-grams at scale
  • βœ…Geo-spatial search combined with text and filters
  • βœ…Multi-tenant search with per-tenant relevance tuning
09

Common Mistakes

πŸ—„οΈ

Using Elasticsearch as a primary database

Treating ES as the source of truth β€” writing data only to ES, relying on it for transactions and consistency. When a node fails or a reindex is needed, data is lost or inconsistent.

βœ…ES is a secondary index, not a primary store. Write to your primary database (PostgreSQL, DynamoDB) first, then sync to ES. The primary DB is the source of truth. ES can always be rebuilt from it.

πŸ“…

Not setting up ILM for time-series data

Indexing months of logs into a single growing index. The index becomes enormous, searches slow down, and you can't delete old data without reindexing everything.

βœ…Use data streams with ILM policies from day one. Define rollover conditions (50 GB or 1 day), warm/cold/frozen phases, and a delete phase. This keeps hot indices small and fast while aging data to cheaper storage automatically.

πŸ’Ύ

JVM heap greater than 32 GB

Setting -Xmx to 48 GB or 64 GB thinking more heap = better performance. Above ~32 GB, the JVM loses compressed oops β€” pointers double in size, wasting 30%+ of heap on overhead. A 48 GB heap may perform worse than 31 GB.

βœ…Never exceed 31 GB heap (stay below the compressed oops threshold). If you need more capacity, add nodes rather than increasing heap. The remaining RAM serves the OS page cache, which Lucene depends on for fast segment reads.

πŸ“Έ

No snapshot/backup policy

Running a production cluster without automated snapshots. A bad mapping change, accidental delete, or cluster failure means permanent data loss with no recovery path.

βœ…Configure SLM (Snapshot Lifecycle Management) to take daily incremental snapshots to S3/GCS. Snapshots are incremental β€” only new segments are uploaded. Test restore procedures regularly. A snapshot that's never been tested is not a backup.

🐌

Ignoring the slow log

Not configuring slow log thresholds. Expensive queries silently degrade cluster performance for all users. By the time you notice, the cluster is already under pressure.

βœ…Enable slow log with reasonable thresholds (warn at 5s, info at 2s). Review slow log weekly. Common culprits: leading wildcards, deep aggregations, script queries, and unbounded from/size pagination. Use the Profile API to diagnose specific slow queries.