ILMVector SearchPerformanceSnapshotsES vs AlternativeskNNData StreamsMonitoring

Operations & Tuning

Index Lifecycle Management, vector search, performance tuning, snapshots, monitoring, and knowing when Elasticsearch is NOT the right choice.

35 min read9 sections

Index Lifecycle Management (ILM)

Time-series data (logs, metrics, events) grows indefinitely. You can't keep everything on fast SSDs forever. ILM automates the lifecycle of indices — moving data through phases based on age or size, optimizing cost and performance at each stage.

ILM Phasestext

Phase Flow:

  HOT → WARM → COLD → FROZEN → DELETE

  ┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐
  │   HOT   │───▶│  WARM   │───▶│  COLD   │───▶│ FROZEN  │───▶│ DELETE  │
  │         │    │         │    │         │    │         │    │         │
  │ Active  │    │ No new  │    │ Rarely  │    │ Cheapest│    │ Remove  │
  │ writes  │    │ writes  │    │ queried │    │ storage │    │ entirely│
  │ Fast SSD│    │ Warm SSD│    │ HDD     │    │ Snapshot│    │         │
  │ Full    │    │ Shrink  │    │ Freeze  │    │ mount   │    │         │
  │ replicas│    │ replicas│    │ replicas│    │         │    │         │
  └─────────┘    └─────────┘    └─────────┘    └─────────┘    └─────────┘

  Typical timeline for logs:
    HOT:    0-3 days   (actively written, frequently searched)
    WARM:   3-30 days  (read-only, occasionally searched)
    COLD:   30-90 days (rarely accessed, compliance retention)
    FROZEN: 90-365 days (searchable snapshots, near-zero cost)
    DELETE: >365 days  (removed entirely)

ILM Policy Definitiontext

PUT _ilm/policy/logs-policy
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_primary_shard_size": "50gb",
            "max_age": "1d"
          },
          "set_priority": { "priority": 100 }
        }
      },
      "warm": {
        "min_age": "3d",
        "actions": {
          "shrink": { "number_of_shards": 1 },
          "forcemerge": { "max_num_segments": 1 },
          "set_priority": { "priority": 50 },
          "allocate": {
            "require": { "data": "warm" }
          }
        }
      },
      "cold": {
        "min_age": "30d",
        "actions": {
          "allocate": {
            "require": { "data": "cold" }
          },
          "set_priority": { "priority": 0 }
        }
      },
      "frozen": {
        "min_age": "90d",
        "actions": {
          "searchable_snapshot": {
            "snapshot_repository": "my-s3-repo"
          }
        }
      },
      "delete": {
        "min_age": "365d",
        "actions": { "delete": {} }
      }
    }
  }
}

Rollover & Data Streams

Rollover automatically creates a new index when the current one hits a size or age threshold. Data streams are the modern abstraction — they manage a series of backing indices behind a single name, with automatic rollover built in.

Data Streamstext

# Data stream = append-only time-series abstraction
# Behind the scenes: a series of backing indices

Data Stream: "logs-nginx"
  ├── .ds-logs-nginx-2024.01.01-000001  (oldest, cold)
  ├── .ds-logs-nginx-2024.01.15-000002  (warm)
  ├── .ds-logs-nginx-2024.01.28-000003  (warm)
  └── .ds-logs-nginx-2024.02.10-000004  (current write index, hot)

# Writes always go to the latest backing index
# Reads span all backing indices transparently
# Rollover creates a new backing index automatically

# Create an index template for the data stream:
PUT _index_template/logs-template
{
  "index_patterns": ["logs-*"],
  "data_stream": {},
  "template": {
    "settings": {
      "index.lifecycle.name": "logs-policy",
      "number_of_shards": 3,
      "number_of_replicas": 1
    },
    "mappings": {
      "properties": {
        "@timestamp": { "type": "date" },
        "message": { "type": "text" },
        "level": { "type": "keyword" },
        "service": { "type": "keyword" }
      }
    }
  }
}

🎯 Data Streams Are the Modern Way

For any time-series data (logs, metrics, traces), use data streams instead of manually managing index aliases and rollover. Data streams enforce append-only semantics, handle rollover automatically, and integrate cleanly with ILM policies.

Vector & Semantic Search

Traditional search matches keywords — if the user searches "affordable laptop" but the document says "budget notebook," BM25 won't find it. Vector search encodes meaning into dense embeddings, enabling semantic similarity matching. Elasticsearch supports both and can combine them.

dense_vector Field & kNN Searchtext

# Step 1: Define a dense_vector field in your mapping
PUT /products
{
  "mappings": {
    "properties": {
      "title": { "type": "text" },
      "description": { "type": "text" },
      "embedding": {
        "type": "dense_vector",
        "dims": 768,
        "index": true,
        "similarity": "cosine"
      }
    }
  }
}

# Step 2: Index documents with embeddings
# (embeddings generated by a model like sentence-transformers)
POST /products/_doc/1
{
  "title": "Budget Notebook Computer",
  "description": "Lightweight laptop for everyday use",
  "embedding": [0.12, -0.34, 0.56, ...]  // 768-dimensional vector
}

# Step 3: kNN query — find semantically similar documents
POST /products/_search
{
  "knn": {
    "field": "embedding",
    "query_vector": [0.11, -0.33, 0.55, ...],  // query embedding
    "k": 10,
    "num_candidates": 100
  }
}

# How HNSW works (the index structure):
#   - Hierarchical Navigable Small World graph
#   - Multi-layer graph: top layers = long-range links, bottom = fine-grained
#   - Search starts at top layer, greedily descends to nearest neighbors
#   - Approximate: trades perfect recall for speed (configurable)
#   - Parameters: m (connections per node), ef_construction (build quality)

Hybrid Search — BM25 + Vector with RRF

Pure keyword search misses semantic matches. Pure vector search misses exact keyword matches (e.g., product SKUs, error codes). Hybrid search combines both using Reciprocal Rank Fusion (RRF) to merge ranked results from each approach.

Hybrid Search with RRFtext

POST /products/_search
{
  "query": {
    "match": {
      "description": "affordable lightweight laptop"
    }
  },
  "knn": {
    "field": "embedding",
    "query_vector": [0.11, -0.33, 0.55, ...],
    "k": 10,
    "num_candidates": 100
  },
  "rank": {
    "rrf": {
      "window_size": 100,
      "rank_constant": 60
    }
  }
}

# How RRF works:
#   For each document, compute: score = Σ 1/(rank_constant + rank_i)
#   where rank_i is the document's rank in each result set
#
#   Example:
#     Doc A: rank 1 in BM25, rank 5 in kNN
#     Doc B: rank 3 in BM25, rank 1 in kNN
#
#     Score A = 1/(60+1) + 1/(60+5) = 0.0164 + 0.0154 = 0.0318
#     Score B = 1/(60+3) + 1/(60+1) = 0.0159 + 0.0164 = 0.0323
#     Doc B wins (strong in both)

# When hybrid beats pure approaches:
#   - "error code NX-4012 memory leak" → keyword matches the code, vector
#     matches the concept
#   - "cheap macbook alternative" → vector understands "cheap" = "affordable",
#     keyword catches "macbook" exactly

🧠

Two Librarians Working Together

Keyword search is like a librarian who only matches exact words in the catalog. Vector search is like a librarian who understands what you mean even if you use different words. Hybrid search asks both librarians, then picks books that both recommend highly — giving you the best of exact matching and semantic understanding.

💡 Embedding Model Choice Matters

The quality of vector search depends entirely on the embedding model. General-purpose models (sentence-transformers, OpenAI embeddings) work for broad content. Domain-specific fine-tuned models dramatically outperform them for specialized content (medical, legal, e-commerce). Always evaluate retrieval quality with your actual data.

Index Design Patterns

How you structure indices determines performance, scalability, and operational complexity. Unlike relational databases, Elasticsearch has no joins — your index design must account for query patterns upfront.

Index Design Strategies

✅Time-based indices (logs-2024.01.15): natural ILM alignment, easy to delete old data, queries target specific time ranges
✅Index per tenant: strong isolation, independent scaling, simple access control — but operational overhead grows with tenant count
✅Single index with tenant field: simpler ops, use filtered aliases for tenant isolation — but noisy neighbor risk
✅Denormalize aggressively: ES has no joins — embed related data into documents at index time (e.g., order contains full product info)
✅Nested objects: use when array elements must be queried independently (e.g., 'find orders where item.color=red AND item.size=large' on the same item)
✅Flattened fields: use for high-cardinality dynamic keys (e.g., user-defined labels) to prevent mapping explosion

Zero-Downtime Reindex with Aliasestext

# Problem: You need to change a field mapping, but mappings are immutable.
# Solution: Create a new index, reindex data, swap the alias atomically.

# Step 1: Your app always queries the alias, never the real index name
#   App → "products" (alias) → products-v1 (real index)

# Step 2: Create new index with updated mapping
PUT /products-v2
{
  "mappings": { ... updated mapping ... }
}

# Step 3: Reindex data from old to new
POST /_reindex
{
  "source": { "index": "products-v1" },
  "dest": { "index": "products-v2" }
}

# Step 4: Atomic alias swap (zero downtime)
POST /_aliases
{
  "actions": [
    { "remove": { "index": "products-v1", "alias": "products" } },
    { "add": { "index": "products-v2", "alias": "products" } }
  ]
}

# Step 5: Delete old index when confident
DELETE /products-v1

# The app never knew anything changed — it always queries "products"

🎯 Always Use Aliases

Never let applications query index names directly. Always use aliases. This gives you the freedom to reindex, split, shrink, or restructure indices without any application changes. It's the ES equivalent of a database view.

Performance Tuning

Elasticsearch performance tuning splits into two domains: indexing throughput (how fast you can write) and search latency (how fast you can read). They often trade off against each other.

Indexing Performance

Technique	What It Does	Impact
Bulk API	Batch multiple index/update/delete operations in one request	10-100x faster than individual requests — always use bulk for batch loads
refresh_interval: 30s	Increase from default 1s during bulk loads	Fewer segment creates = faster indexing (data visible with delay)
number_of_replicas: 0	Disable replicas during initial bulk load	Halves write work — re-enable after load completes
Disable swapping	bootstrap.memory_lock: true	Prevents JVM heap from being swapped to disk (catastrophic for latency)
Translog flush threshold	Increase index.translog.flush_threshold_size	Fewer fsyncs during heavy indexing — slight durability trade-off
Mapping: index false	Set index: false on fields you never search	Saves CPU and disk — field is stored but not indexed

Search Performance

Technique	What It Does	When to Use
Filter context	Filters are cached as bitsets, not scored	Always use filter for yes/no conditions (status, date ranges, tenant ID)
Avoid leading wildcards	query: '*phone' forces scan of all terms	Use edge n-grams or reverse field instead of leading wildcard queries
search_after	Cursor-based pagination using sort values	Deep pagination (page 1000+) — from/size degrades linearly
Profile API	_search with profile: true shows time per query phase	Diagnosing slow queries — identifies which clause is expensive
Routing	Direct queries to specific shards	Multi-tenant: route by tenant_id so queries hit 1 shard instead of all
Shard sizing	Target 10-50 GB per shard	Too many small shards = overhead; too few large shards = slow queries

search_after — Deep Paginationtext

# Problem: from: 10000, size: 10 requires coordinating node to
# fetch 10010 docs from EACH shard, then discard 10000. Extremely wasteful.

# Solution: search_after uses the sort values of the last result as cursor

# First page:
POST /logs/_search
{
  "size": 100,
  "sort": [
    { "@timestamp": "desc" },
    { "_id": "asc" }
  ],
  "query": { "match": { "level": "error" } }
}

# Next page — pass sort values from last hit:
POST /logs/_search
{
  "size": 100,
  "sort": [
    { "@timestamp": "desc" },
    { "_id": "asc" }
  ],
  "query": { "match": { "level": "error" } },
  "search_after": [1707523200000, "doc_abc123"]
}

# Each shard only returns docs AFTER the cursor — no wasted work
# Consistent performance regardless of page depth

Memory & Resource Management

Elasticsearch runs on the JVM but relies heavily on the OS page cache for Lucene segment reads. Getting the memory split wrong is one of the most common causes of cluster instability.

Memory Architecturetext

Total Server RAM: 64 GB (example)

  ┌─────────────────────────────────────────────────────────┐
  │                    64 GB Total RAM                       │
  ├──────────────────────┬──────────────────────────────────┤
  │   JVM Heap: 31 GB   │   OS Page Cache: ~31 GB          │
  │   (ES internals)    │   (Lucene segment files)         │
  │                     │                                  │
  │   • Field data      │   • Segment reads (search)       │
  │   • Node query      │   • Merges                       │
  │     cache           │   • Stored fields                │
  │   • Indexing buffer │   • Doc values                   │
  │   • Cluster state   │   • Term dictionaries            │
  │   • Aggregations    │                                  │
  └──────────────────────┴──────────────────────────────────┘

  Rules:
    1. JVM heap ≤ 50% of RAM (leave rest for page cache)
    2. JVM heap ≤ 31 GB (compressed oops threshold)
    3. Both Xms and Xmx must be equal (avoid resize pauses)
    4. Never disable swap — use memory_lock instead

💡 The 32 GB Compressed Oops Boundary

The JVM uses "compressed ordinary object pointers" (oops) when heap is below ~32 GB. This lets 4-byte pointers address 32 GB of memory. Above 32 GB, pointers expand to 8 bytes — effectively wasting ~30% of heap on pointer overhead. A 31 GB heap often outperforms a 40 GB heap. Never set heap between 32-40 GB.

Resource	Purpose	Tuning Guidance
JVM Heap	ES internal data structures, caches, aggregation buffers	Set to min(50% RAM, 31 GB). Equal Xms/Xmx. G1GC for heaps > 8 GB.
OS Page Cache	Caches Lucene segment files for fast reads	Leave at least 50% RAM for this. More = faster searches.
Field Data Cache	In-memory uninverted index for text field aggregations	AVOID — use keyword fields or doc_values instead. Set indices.fielddata.cache.size limit.
Node Query Cache	Caches filter clause results as bitsets	Default 10% heap. Effective for repeated filters (tenant_id, status).
Indexing Buffer	Buffers new documents before creating segments	Default 10% heap. Increase for heavy indexing workloads.
Circuit Breakers	Prevent OOM by rejecting requests that would exceed limits	Parent breaker: 95% heap. Don't disable — they protect cluster stability.

Monitoring & Reliability

Elasticsearch clusters degrade silently before they fail loudly. Monitoring the right metrics gives you early warning. Snapshots give you recovery when things go wrong.

Metric	What It Means	Alert Threshold	Action
Cluster Health	green/yellow/red — shard allocation status	Yellow > 5 min, Red immediately	Yellow = unassigned replicas. Red = unassigned primaries (data loss risk).
JVM Heap Usage	Percentage of heap in use	> 85% sustained	Frequent GC, risk of OOM. Reduce caches, add nodes, or increase heap (up to 31 GB).
GC Time	Time spent in garbage collection	> 500ms per collection, or > 5% of time in GC	Long GC pauses cause node timeouts. Check for field data, large aggregations.
Search Latency (p99)	99th percentile search response time	Depends on SLA (e.g., > 500ms)	Check slow log, profile queries, verify shard sizing.
Indexing Rate	Documents indexed per second	Sudden drop or spike	Drop = upstream issue or rejections. Spike = bulk load affecting search.
Thread Pool Rejections	Requests rejected due to full queue	> 0 (search or write rejections)	Cluster is overloaded. Scale out or reduce request rate.
Disk Watermarks	Low (85%), High (90%), Flood (95%)	Approaching low watermark	At high: no new shards allocated. At flood: index set to read-only.

Snapshot & Restore

Snapshot Configurationtext

# Register a snapshot repository (S3 example)
PUT /_snapshot/my-s3-repo
{
  "type": "s3",
  "settings": {
    "bucket": "my-es-backups",
    "region": "us-east-1",
    "base_path": "elasticsearch/snapshots"
  }
}

# Create a snapshot (incremental — only new/changed segments)
PUT /_snapshot/my-s3-repo/snapshot-2024-02-10
{
  "indices": "logs-*,products",
  "ignore_unavailable": true,
  "include_global_state": false
}

# Automate with SLM (Snapshot Lifecycle Management)
PUT /_slm/policy/nightly-snapshots
{
  "schedule": "0 30 2 * * ?",
  "name": "<nightly-snap-{now/d}>",
  "repository": "my-s3-repo",
  "config": {
    "indices": ["*"],
    "ignore_unavailable": true
  },
  "retention": {
    "expire_after": "30d",
    "min_count": 5,
    "max_count": 50
  }
}

# Restore a specific index from snapshot
POST /_snapshot/my-s3-repo/snapshot-2024-02-10/_restore
{
  "indices": "products",
  "rename_pattern": "(.+)",
  "rename_replacement": "restored-$1"
}

Slow Log

Slow Log Configurationtext

# Enable slow log for an index
PUT /products/_settings
{
  "index.search.slowlog.threshold.query.warn": "5s",
  "index.search.slowlog.threshold.query.info": "2s",
  "index.search.slowlog.threshold.fetch.warn": "1s",
  "index.indexing.slowlog.threshold.index.warn": "10s",
  "index.indexing.slowlog.threshold.index.info": "5s"
}

# Slow log output includes:
#   - The full query that was slow
#   - Which shard it ran on
#   - Total time breakdown (query phase, fetch phase)
#   - Number of hits

# Use slow log to:
#   1. Identify expensive queries in production
#   2. Find queries that need optimization (wildcards, deep aggs)
#   3. Detect shard hotspots (one shard consistently slow)

🎯 Shard Allocation Awareness

Use shard allocation awareness to spread replicas across failure domains (availability zones, racks). This ensures that losing one zone doesn't lose both primary and replica of the same shard. Configure with cluster.routing.allocation.awareness.attributes.

Security

Elasticsearch clusters exposed without security have been the source of numerous data breaches. Since version 8.0, security is enabled by default — but understanding the layers is essential.

Layer	Mechanism	Scope
Encryption in transit	TLS for HTTP (client→node) and transport (node→node)	Prevents eavesdropping and MITM attacks on cluster traffic
Authentication	Native realm, LDAP, Active Directory, SAML, OIDC, PKI (mTLS)	Verifies identity of users and services connecting to the cluster
Authorization (RBAC)	Roles with index, cluster, and field-level privileges	Controls what authenticated users can do
Field-level security	Roles can restrict which fields a user can see	e.g., support team can see order status but not payment details
Document-level security	Roles include a query filter — users only see matching docs	e.g., tenant_id filter ensures each tenant sees only their data
API Keys	Scoped, time-limited keys for service-to-service auth	Preferred over username/password for applications

RBAC Exampletext

# Create a role with index-level and field-level security
POST /_security/role/support_agent
{
  "indices": [
    {
      "names": ["orders-*"],
      "privileges": ["read"],
      "field_security": {
        "grant": ["order_id", "status", "customer_name", "created_at"],
        "except": ["payment_card", "billing_address"]
      },
      "query": {
        "term": { "region": "us-east" }
      }
    }
  ]
}

# Create an API key for a microservice
POST /_security/api_key
{
  "name": "order-service-key",
  "expiration": "30d",
  "role_descriptors": {
    "order_writer": {
      "indices": [
        {
          "names": ["orders-*"],
          "privileges": ["write", "create_index"]
        }
      ]
    }
  }
}

Elasticsearch vs Alternatives

Elasticsearch is powerful but not always the right tool. Understanding when alternatives are better prevents over-engineering and reduces operational burden.

Comparison	ES Strength	Alternative Strength	Choose Alternative When
ES vs PostgreSQL FTS	Distributed, custom analyzers, relevance tuning, fuzzy matching, aggregations	No extra infra, ACID transactions, simpler ops, good enough for basic search	< 1M docs, simple search needs, already using Postgres, can't justify another system
ES vs Solr	Better REST API, easier clustering, faster innovation, richer ecosystem (ELK)	Mature, battle-tested, better for static collections, strong XML/faceting	Existing Solr investment, static document collections, Hadoop integration needed
ES vs Pinecone/Weaviate	Hybrid search (keyword + vector), full-text capabilities, existing ecosystem	Purpose-built for vectors, simpler API, managed scaling, better recall at scale	Pure semantic search, no keyword needs, want managed service, billions of vectors
ES vs ClickHouse	Full-text search, fuzzy matching, complex query DSL	10-100x faster for analytical queries, columnar storage, SQL interface	Analytics/OLAP workload, aggregations over structured data, no full-text needs
ES vs Loki (logs)	Rich query language, full-text search across logs, aggregations	10x cheaper storage, label-based indexing, native Grafana integration, simpler ops	Cost-sensitive log storage, label-based filtering is sufficient, Grafana stack

When ES Is Overkill

Skip Elasticsearch When

❌Simple LIKE queries on < 1M rows — PostgreSQL full-text search or trigram index handles this fine
❌Pure analytics/dashboards on structured data — ClickHouse or BigQuery are 10-100x faster and cheaper
❌You only need log grep with labels — Loki + Grafana costs a fraction of ELK
❌Pure vector/embedding search — Pinecone, Weaviate, or pgvector are simpler and purpose-built
❌You need ACID transactions — ES is eventually consistent, not a primary database
❌Team lacks ES operational expertise — the learning curve and ops burden are significant

Elasticsearch Is Ideal When

✅Full-text search with relevance ranking, fuzzy matching, synonyms, and custom analyzers
✅Hybrid search combining keyword (BM25) and vector (kNN) approaches
✅Real-time log analytics with complex queries across terabytes of data
✅Autocomplete/typeahead with edge n-grams at scale
✅Geo-spatial search combined with text and filters
✅Multi-tenant search with per-tenant relevance tuning

Common Mistakes

🗄️

Using Elasticsearch as a primary database

Treating ES as the source of truth — writing data only to ES, relying on it for transactions and consistency. When a node fails or a reindex is needed, data is lost or inconsistent.

✅ES is a secondary index, not a primary store. Write to your primary database (PostgreSQL, DynamoDB) first, then sync to ES. The primary DB is the source of truth. ES can always be rebuilt from it.

📅

Not setting up ILM for time-series data

Indexing months of logs into a single growing index. The index becomes enormous, searches slow down, and you can't delete old data without reindexing everything.

✅Use data streams with ILM policies from day one. Define rollover conditions (50 GB or 1 day), warm/cold/frozen phases, and a delete phase. This keeps hot indices small and fast while aging data to cheaper storage automatically.

💾

JVM heap greater than 32 GB

Setting -Xmx to 48 GB or 64 GB thinking more heap = better performance. Above ~32 GB, the JVM loses compressed oops — pointers double in size, wasting 30%+ of heap on overhead. A 48 GB heap may perform worse than 31 GB.

✅Never exceed 31 GB heap (stay below the compressed oops threshold). If you need more capacity, add nodes rather than increasing heap. The remaining RAM serves the OS page cache, which Lucene depends on for fast segment reads.

📸

No snapshot/backup policy

Running a production cluster without automated snapshots. A bad mapping change, accidental delete, or cluster failure means permanent data loss with no recovery path.

✅Configure SLM (Snapshot Lifecycle Management) to take daily incremental snapshots to S3/GCS. Snapshots are incremental — only new segments are uploaded. Test restore procedures regularly. A snapshot that's never been tested is not a backup.

🐌

Ignoring the slow log

Not configuring slow log thresholds. Expensive queries silently degrade cluster performance for all users. By the time you notice, the cluster is already under pressure.

✅Enable slow log with reasonable thresholds (warn at 5s, info at 2s). Review slow log weekly. Common culprits: leading wildcards, deep aggregations, script queries, and unbounded from/size pagination. Use the Profile API to diagnose specific slow queries.