Inverted IndexAnalyzersTokenizersStemmingN-gramsMappingsText vs KeywordBM25

Inverted Index & Text Analysis

The foundation of everything in Elasticsearch — how text is broken into tokens, stored in an inverted index, and matched at query time.

40 min read9 sections

Why Traditional DBs Fail at Search

When you run SELECT * FROM products WHERE name LIKE '%running shoes%', the database performs a full table scan — checking every row sequentially. There's no index to help. It can't rank results by relevance. It won't match "run shoe" or "shoes for running." This is fundamentally the wrong data structure for search.

Aspect	SQL LIKE	Elasticsearch
Performance	Full table scan — O(N)	Inverted index lookup — O(1) to O(log N)
Relevance ranking	None — results are unordered	BM25 scoring — best matches first
Fuzzy matching	Not supported	Edit distance, phonetic, stemming
Partial word match	Only with leading % (kills index)	N-grams, edge n-grams, prefix queries
Synonyms	Not supported	Built-in synonym token filter
Scalability	Single node, single table	Distributed across shards and nodes

📚

The Library Card Catalog

SQL LIKE is like walking through every shelf in a library checking each book's title. An inverted index is like the card catalog — you look up 'running' and instantly get a list of every book containing that word, sorted by relevance. The catalog is built once (at index time) so lookups are instant.

🔑 ES is a Secondary Index

Elasticsearch is NOT a primary database. It has no transactions, no referential integrity, and is eventually consistent. The canonical pattern: write to your primary DB (PostgreSQL), then sync to ES for search. ES is a read-optimized secondary index.

The Inverted Index

An inverted index maps every unique term to the list of documents containing it. It's "inverted" because instead of document → terms (forward index), it stores term → documents. This is what makes search O(1) instead of O(N).

Forward Index vs Inverted Indextext

Forward Index (what a database stores):
  Doc 1: "The quick brown fox"
  Doc 2: "The quick blue car"
  Doc 3: "A brown dog"

Inverted Index (what Elasticsearch builds):
  Term        → Document IDs (postings list)
  ─────────────────────────────────────────
  "a"         → [3]
  "blue"      → [2]
  "brown"     → [1, 3]
  "car"       → [2]
  "dog"       → [3]
  "fox"       → [1]
  "quick"     → [1, 2]
  "the"       → [1, 2]

Search for "brown":
  → Look up "brown" in term dictionary → [1, 3]
  → Return Doc 1 and Doc 3
  → No scanning needed — direct lookup

Each postings list also stores:
  - Term frequency (TF): how many times the term appears in each doc
  - Positions: where in the document the term appears (for phrase queries)
  - Offsets: character positions (for highlighting)

📖

The Book Index

The index at the back of a textbook is an inverted index. 'Photosynthesis → pages 42, 87, 156'. You don't read the whole book to find where photosynthesis is discussed — you look it up in the index and jump directly to those pages. Elasticsearch does the same thing, but for millions of documents and thousands of terms.

💡 Built at Write Time, Fast at Read Time

The inverted index is built when documents are indexed (written). This makes writes slightly slower but reads extremely fast. This is the fundamental trade-off: ES optimizes for read performance at the cost of write complexity. This is why ES is great for search but not for frequent updates.

Text Analysis Pipeline

Before text enters the inverted index, it goes through an analysis pipeline. This pipeline transforms raw text into normalized tokens. The same pipeline runs at query time so that search terms match indexed terms.

Analysis Pipeline — Three Stepstext

Input text: "<p>The Quick BROWN Fox's running!</p>"

Step 1: Character Filters (transform characters)
  → html_strip: "The Quick BROWN Fox's running!"
  → mapping: (custom replacements if configured)

Step 2: Tokenizer (split into tokens)
  → standard tokenizer: ["The", "Quick", "BROWN", "Fox's", "running"]

Step 3: Token Filters (transform tokens)
  → lowercase:    ["the", "quick", "brown", "fox's", "running"]
  → apostrophe:   ["the", "quick", "brown", "fox", "running"]
  → stop words:   ["quick", "brown", "fox", "running"]
  → stemming:     ["quick", "brown", "fox", "run"]

Final tokens stored in inverted index: ["quick", "brown", "fox", "run"]

Now when user searches "runs":
  → Same pipeline: "runs" → lowercase → stem → "run"
  → "run" matches the indexed token "run" ✓
  → Document is returned

Why Analysis Must Match

✅Index-time analysis produces tokens stored in the inverted index
✅Query-time analysis produces tokens used for lookup
✅If they use different analyzers, terms won't match (e.g., stemmed vs unstemmed)
✅The _analyze API lets you test what tokens an analyzer produces
✅Always verify your analyzer produces expected tokens before deploying

Analyzers

Analyzer	What It Does	Use Case
standard (default)	Lowercase + standard tokenizer + stop words	General-purpose full-text search
whitespace	Splits on whitespace only, no lowercasing	When you need exact token boundaries
keyword	No tokenization — entire value is one token	Exact match fields (email, URL, ID)
english	Standard + English stemming + stop words	English-language content
custom	Your own char filters + tokenizer + token filters	Domain-specific analysis needs

Custom Analyzer Example — Autocompletetext

PUT /products
{
  "settings": {
    "analysis": {
      "analyzer": {
        "autocomplete_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "autocomplete_filter"]
        },
        "autocomplete_search": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase"]
        }
      },
      "filter": {
        "autocomplete_filter": {
          "type": "edge_ngram",
          "min_gram": 2,
          "max_gram": 15
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "analyzer": "autocomplete_analyzer",
        "search_analyzer": "autocomplete_search"
      }
    }
  }
}

# "iPhone" indexed as: ["ip", "iph", "ipho", "iphon", "iphone"]
# Search "iph" → matches because "iph" is in the index
# search_analyzer doesn't edge_ngram the query — prevents over-matching

Stemming & Synonyms

Feature	How It Works	Trade-off
Stemming	'running' → 'run', 'runs' → 'run'	Over-stemming: 'university' → 'univers' matches 'universal'
Synonyms (index-time)	Expand at index: 'car' stored as ['car', 'automobile']	Larger index, can't update synonyms without reindex
Synonyms (query-time)	Expand at search: query 'car' also searches 'automobile'	Slower queries, but synonyms updatable without reindex
Edge N-grams	'phone' → ['ph', 'pho', 'phon', 'phone']	Much larger index, but enables prefix/autocomplete matching

Mappings & Field Types

A mapping is the schema definition for an index — it defines field names, types, and how each field is analyzed and stored. Getting mappings wrong is expensive because you cannot change a field's type after creation — you must reindex.

Key Field Typestext

Text types:
  text      — analyzed, broken into tokens, for full-text search
  keyword   — NOT analyzed, stored as-is, for exact match/sort/agg

Numeric:
  integer, long, float, double, scaled_float

Date:
  date      — ISO 8601 or epoch millis, supports range queries

Boolean:
  boolean   — true/false

Geo:
  geo_point — lat/lon coordinate
  geo_shape — polygons, lines, complex shapes

Specialized:
  ip         — IPv4/IPv6 addresses
  completion — FST-based autocomplete (fastest prefix matching)
  dense_vector — embedding vectors for kNN search
  nested     — arrays of objects with independent field queries
  join       — parent-child relationships within an index

Object (default for JSON objects):
  object    — flattened, inner fields lose independence
  nested    — preserves object boundaries (more expensive)
  flattened — entire JSON as single opaque field

Critical Mapping Decisions

✅text vs keyword — the most common mistake; use text for search, keyword for exact match/sort/agg
✅index: false — store a field but don't index it (saves disk, can't search on it)
✅doc_values: false — disable for fields never used in aggregations/sorting (saves disk)
✅dynamic: strict — reject documents with unmapped fields (prevents mapping explosion)
✅copy_to — copy multiple fields into a catch-all field for simple cross-field search

text vs keyword

This is the single most important mapping decision and the most common source of bugs. text fields are analyzed (broken into tokens for full-text search). keyword fields are stored as-is (for exact match, sorting, and aggregations).

Aspect	text	keyword
Analysis	Yes — tokenized, lowercased, stemmed	No — stored exactly as provided
Search type	Full-text (match query)	Exact match (term query)
Sorting	Cannot sort (multiple tokens per field)	Can sort alphabetically
Aggregations	Cannot aggregate efficiently	Can aggregate (terms agg)
Example field	Product description, article body	Email, status, country code, URL
Storage	Inverted index (tokens)	Doc values (columnar) + inverted index

text vs keyword — The Classic Bugtext

# Document indexed:
PUT /users/_doc/1
{ "email": "Alice@Example.COM" }

# If email is mapped as "text" (WRONG for email):
#   Analyzed: ["alice", "example.com"] or ["alice", "example", "com"]
#   term query for "Alice@Example.COM" → NO MATCH (it's been lowercased/tokenized)
#   match query for "alice" → MATCHES (but also matches "alice in wonderland")

# If email is mapped as "keyword" (CORRECT for email):
#   Stored as-is: "Alice@Example.COM"
#   term query for "Alice@Example.COM" → EXACT MATCH ✓
#   Can sort, aggregate, and filter on exact values

# The multi-field pattern (have both):
"title": {
  "type": "text",           ← full-text search on title
  "fields": {
    "raw": {
      "type": "keyword"     ← exact match, sort, aggregate on title.raw
    }
  }
}

Multi-fields & Mapping Design

Multi-fields let you index the same source field in multiple ways. The most common pattern: a text sub-field for search and a keyword sub-field for sorting/aggregation.

Production Mapping Exampletext

PUT /products
{
  "mappings": {
    "dynamic": "strict",
    "properties": {
      "name": {
        "type": "text",
        "analyzer": "english",
        "fields": {
          "raw": { "type": "keyword" },
          "autocomplete": {
            "type": "text",
            "analyzer": "autocomplete_analyzer",
            "search_analyzer": "autocomplete_search"
          }
        }
      },
      "description": {
        "type": "text",
        "analyzer": "english"
      },
      "price": { "type": "scaled_float", "scaling_factor": 100 },
      "category": { "type": "keyword" },
      "tags": { "type": "keyword" },
      "created_at": { "type": "date" },
      "location": { "type": "geo_point" },
      "metadata": { "type": "object", "enabled": false }
    }
  }
}

# name       → full-text search with English stemming
# name.raw   → exact match, sorting, aggregations
# name.autocomplete → typeahead/prefix matching
# metadata   → stored but not indexed (enabled: false)

💡 You Cannot Change Field Types

Once a field is mapped, you cannot change its type. Adding new fields is fine, but changing "text" to "keyword" requires creating a new index with the correct mapping and reindexing all documents. Use the alias swap pattern for zero-downtime reindex.

Interview Questions

Q:What is an inverted index and why is it fast?

A: An inverted index maps each unique term to the list of documents containing it (term → doc IDs). Search is fast because it's a direct lookup — find the term in the sorted dictionary, get the postings list. No scanning needed. It's built at write time so reads are O(1) to O(log N) regardless of collection size. Traditional databases scan every row (O(N)) because they use forward indices (doc → terms).

Q:Why must index-time and query-time analysis match?

A: If you stem 'running' to 'run' at index time but don't stem the query 'running', the query looks for 'running' in the inverted index but only 'run' exists — no match. Both sides must produce the same tokens. This is why ES applies the same analyzer at both times by default. Using different analyzers (search_analyzer) is only safe when you understand the asymmetry (e.g., edge n-grams at index time, standard at query time).

Q:Explain the difference between text and keyword field types.

A: text fields are analyzed — broken into tokens by an analyzer for full-text search. You can't sort or aggregate on them efficiently. keyword fields are stored as-is — no analysis. They support exact match, sorting, and aggregations but not full-text search. The multi-field pattern gives you both: 'title' as text for search, 'title.raw' as keyword for sort/agg.

Q:Why is Elasticsearch not suitable as a primary database?

A: ES lacks ACID transactions, has no referential integrity, is eventually consistent (refresh interval), and updates are expensive (delete + reindex internally). It's optimized for read-heavy search workloads, not transactional writes. The canonical pattern: PostgreSQL as primary DB (source of truth) → sync to ES for search capabilities.

Common Mistakes

🔤

Using text type for exact-match fields

Mapping email, status, or ID fields as 'text'. Queries for 'ACTIVE' don't match because the field was lowercased to 'active'. Aggregations return individual tokens instead of full values.

✅Use 'keyword' for any field that needs exact match, sorting, or aggregation. Use the multi-field pattern (text + keyword sub-field) when you need both full-text search AND exact operations on the same field.

💥

Dynamic mapping in production

Leaving dynamic mapping enabled. A single malformed document with unexpected fields creates hundreds of new field mappings, bloating cluster state and causing performance issues.

✅Always set dynamic: 'strict' in production mappings. This rejects documents with unmapped fields. Define all fields explicitly. Use dynamic: false if you want to store but not index unknown fields.

🔍

Using term query on text fields

Running a term query (exact match) on a text field. The field was analyzed to lowercase tokens, but the term query doesn't analyze the input — 'iPhone' doesn't match the indexed token 'iphone'.

✅Use match query for text fields (it analyzes the input). Use term query only for keyword fields. If you need exact match on a text field, add a keyword sub-field and query that instead.

📊

Not using the multi-field pattern

Creating separate fields for search and aggregation (name_search as text, name_exact as keyword), requiring double the storage and complex indexing logic.

✅Use multi-fields: map the field as text with a .raw keyword sub-field. One source field, multiple indexing strategies. Search on 'name', sort/aggregate on 'name.raw'.

🔄

Mismatched analyzers between index and query time

Using a custom analyzer at index time but forgetting to set search_analyzer. Or using edge n-grams at both index and query time, causing over-matching.

✅For autocomplete: use edge_ngram analyzer at index time, standard analyzer at query time (search_analyzer). Always test with the _analyze API to verify tokens match expectations.