Inverted Index & Text Analysis
The foundation of everything in Elasticsearch — how text is broken into tokens, stored in an inverted index, and matched at query time.
Table of Contents
Why Traditional DBs Fail at Search
When you run SELECT * FROM products WHERE name LIKE '%running shoes%', the database performs a full table scan — checking every row sequentially. There's no index to help. It can't rank results by relevance. It won't match "run shoe" or "shoes for running." This is fundamentally the wrong data structure for search.
| Aspect | SQL LIKE | Elasticsearch |
|---|---|---|
| Performance | Full table scan — O(N) | Inverted index lookup — O(1) to O(log N) |
| Relevance ranking | None — results are unordered | BM25 scoring — best matches first |
| Fuzzy matching | Not supported | Edit distance, phonetic, stemming |
| Partial word match | Only with leading % (kills index) | N-grams, edge n-grams, prefix queries |
| Synonyms | Not supported | Built-in synonym token filter |
| Scalability | Single node, single table | Distributed across shards and nodes |
The Library Card Catalog
SQL LIKE is like walking through every shelf in a library checking each book's title. An inverted index is like the card catalog — you look up 'running' and instantly get a list of every book containing that word, sorted by relevance. The catalog is built once (at index time) so lookups are instant.
🔑 ES is a Secondary Index
Elasticsearch is NOT a primary database. It has no transactions, no referential integrity, and is eventually consistent. The canonical pattern: write to your primary DB (PostgreSQL), then sync to ES for search. ES is a read-optimized secondary index.
The Inverted Index
An inverted index maps every unique term to the list of documents containing it. It's "inverted" because instead of document → terms (forward index), it stores term → documents. This is what makes search O(1) instead of O(N).
Forward Index (what a database stores): Doc 1: "The quick brown fox" Doc 2: "The quick blue car" Doc 3: "A brown dog" Inverted Index (what Elasticsearch builds): Term → Document IDs (postings list) ───────────────────────────────────────── "a" → [3] "blue" → [2] "brown" → [1, 3] "car" → [2] "dog" → [3] "fox" → [1] "quick" → [1, 2] "the" → [1, 2] Search for "brown": → Look up "brown" in term dictionary → [1, 3] → Return Doc 1 and Doc 3 → No scanning needed — direct lookup Each postings list also stores: - Term frequency (TF): how many times the term appears in each doc - Positions: where in the document the term appears (for phrase queries) - Offsets: character positions (for highlighting)
The Book Index
The index at the back of a textbook is an inverted index. 'Photosynthesis → pages 42, 87, 156'. You don't read the whole book to find where photosynthesis is discussed — you look it up in the index and jump directly to those pages. Elasticsearch does the same thing, but for millions of documents and thousands of terms.
💡 Built at Write Time, Fast at Read Time
The inverted index is built when documents are indexed (written). This makes writes slightly slower but reads extremely fast. This is the fundamental trade-off: ES optimizes for read performance at the cost of write complexity. This is why ES is great for search but not for frequent updates.
Text Analysis Pipeline
Before text enters the inverted index, it goes through an analysis pipeline. This pipeline transforms raw text into normalized tokens. The same pipeline runs at query time so that search terms match indexed terms.
Input text: "<p>The Quick BROWN Fox's running!</p>" Step 1: Character Filters (transform characters) → html_strip: "The Quick BROWN Fox's running!" → mapping: (custom replacements if configured) Step 2: Tokenizer (split into tokens) → standard tokenizer: ["The", "Quick", "BROWN", "Fox's", "running"] Step 3: Token Filters (transform tokens) → lowercase: ["the", "quick", "brown", "fox's", "running"] → apostrophe: ["the", "quick", "brown", "fox", "running"] → stop words: ["quick", "brown", "fox", "running"] → stemming: ["quick", "brown", "fox", "run"] Final tokens stored in inverted index: ["quick", "brown", "fox", "run"] Now when user searches "runs": → Same pipeline: "runs" → lowercase → stem → "run" → "run" matches the indexed token "run" ✓ → Document is returned
Why Analysis Must Match
- ✅Index-time analysis produces tokens stored in the inverted index
- ✅Query-time analysis produces tokens used for lookup
- ✅If they use different analyzers, terms won't match (e.g., stemmed vs unstemmed)
- ✅The _analyze API lets you test what tokens an analyzer produces
- ✅Always verify your analyzer produces expected tokens before deploying
Analyzers
| Analyzer | What It Does | Use Case |
|---|---|---|
| standard (default) | Lowercase + standard tokenizer + stop words | General-purpose full-text search |
| whitespace | Splits on whitespace only, no lowercasing | When you need exact token boundaries |
| keyword | No tokenization — entire value is one token | Exact match fields (email, URL, ID) |
| english | Standard + English stemming + stop words | English-language content |
| custom | Your own char filters + tokenizer + token filters | Domain-specific analysis needs |
PUT /products { "settings": { "analysis": { "analyzer": { "autocomplete_analyzer": { "type": "custom", "tokenizer": "standard", "filter": ["lowercase", "autocomplete_filter"] }, "autocomplete_search": { "type": "custom", "tokenizer": "standard", "filter": ["lowercase"] } }, "filter": { "autocomplete_filter": { "type": "edge_ngram", "min_gram": 2, "max_gram": 15 } } } }, "mappings": { "properties": { "name": { "type": "text", "analyzer": "autocomplete_analyzer", "search_analyzer": "autocomplete_search" } } } } # "iPhone" indexed as: ["ip", "iph", "ipho", "iphon", "iphone"] # Search "iph" → matches because "iph" is in the index # search_analyzer doesn't edge_ngram the query — prevents over-matching
Stemming & Synonyms
| Feature | How It Works | Trade-off |
|---|---|---|
| Stemming | 'running' → 'run', 'runs' → 'run' | Over-stemming: 'university' → 'univers' matches 'universal' |
| Synonyms (index-time) | Expand at index: 'car' stored as ['car', 'automobile'] | Larger index, can't update synonyms without reindex |
| Synonyms (query-time) | Expand at search: query 'car' also searches 'automobile' | Slower queries, but synonyms updatable without reindex |
| Edge N-grams | 'phone' → ['ph', 'pho', 'phon', 'phone'] | Much larger index, but enables prefix/autocomplete matching |
Mappings & Field Types
A mapping is the schema definition for an index — it defines field names, types, and how each field is analyzed and stored. Getting mappings wrong is expensive because you cannot change a field's type after creation — you must reindex.
Text types: text — analyzed, broken into tokens, for full-text search keyword — NOT analyzed, stored as-is, for exact match/sort/agg Numeric: integer, long, float, double, scaled_float Date: date — ISO 8601 or epoch millis, supports range queries Boolean: boolean — true/false Geo: geo_point — lat/lon coordinate geo_shape — polygons, lines, complex shapes Specialized: ip — IPv4/IPv6 addresses completion — FST-based autocomplete (fastest prefix matching) dense_vector — embedding vectors for kNN search nested — arrays of objects with independent field queries join — parent-child relationships within an index Object (default for JSON objects): object — flattened, inner fields lose independence nested — preserves object boundaries (more expensive) flattened — entire JSON as single opaque field
Critical Mapping Decisions
- ✅text vs keyword — the most common mistake; use text for search, keyword for exact match/sort/agg
- ✅index: false — store a field but don't index it (saves disk, can't search on it)
- ✅doc_values: false — disable for fields never used in aggregations/sorting (saves disk)
- ✅dynamic: strict — reject documents with unmapped fields (prevents mapping explosion)
- ✅copy_to — copy multiple fields into a catch-all field for simple cross-field search
text vs keyword
This is the single most important mapping decision and the most common source of bugs. text fields are analyzed (broken into tokens for full-text search). keyword fields are stored as-is (for exact match, sorting, and aggregations).
| Aspect | text | keyword |
|---|---|---|
| Analysis | Yes — tokenized, lowercased, stemmed | No — stored exactly as provided |
| Search type | Full-text (match query) | Exact match (term query) |
| Sorting | Cannot sort (multiple tokens per field) | Can sort alphabetically |
| Aggregations | Cannot aggregate efficiently | Can aggregate (terms agg) |
| Example field | Product description, article body | Email, status, country code, URL |
| Storage | Inverted index (tokens) | Doc values (columnar) + inverted index |
# Document indexed: PUT /users/_doc/1 { "email": "Alice@Example.COM" } # If email is mapped as "text" (WRONG for email): # Analyzed: ["alice", "example.com"] or ["alice", "example", "com"] # term query for "Alice@Example.COM" → NO MATCH (it's been lowercased/tokenized) # match query for "alice" → MATCHES (but also matches "alice in wonderland") # If email is mapped as "keyword" (CORRECT for email): # Stored as-is: "Alice@Example.COM" # term query for "Alice@Example.COM" → EXACT MATCH ✓ # Can sort, aggregate, and filter on exact values # The multi-field pattern (have both): "title": { "type": "text", ← full-text search on title "fields": { "raw": { "type": "keyword" ← exact match, sort, aggregate on title.raw } } }
Multi-fields & Mapping Design
Multi-fields let you index the same source field in multiple ways. The most common pattern: a text sub-field for search and a keyword sub-field for sorting/aggregation.
PUT /products { "mappings": { "dynamic": "strict", "properties": { "name": { "type": "text", "analyzer": "english", "fields": { "raw": { "type": "keyword" }, "autocomplete": { "type": "text", "analyzer": "autocomplete_analyzer", "search_analyzer": "autocomplete_search" } } }, "description": { "type": "text", "analyzer": "english" }, "price": { "type": "scaled_float", "scaling_factor": 100 }, "category": { "type": "keyword" }, "tags": { "type": "keyword" }, "created_at": { "type": "date" }, "location": { "type": "geo_point" }, "metadata": { "type": "object", "enabled": false } } } } # name → full-text search with English stemming # name.raw → exact match, sorting, aggregations # name.autocomplete → typeahead/prefix matching # metadata → stored but not indexed (enabled: false)
💡 You Cannot Change Field Types
Once a field is mapped, you cannot change its type. Adding new fields is fine, but changing "text" to "keyword" requires creating a new index with the correct mapping and reindexing all documents. Use the alias swap pattern for zero-downtime reindex.
Interview Questions
Q:What is an inverted index and why is it fast?
A: An inverted index maps each unique term to the list of documents containing it (term → doc IDs). Search is fast because it's a direct lookup — find the term in the sorted dictionary, get the postings list. No scanning needed. It's built at write time so reads are O(1) to O(log N) regardless of collection size. Traditional databases scan every row (O(N)) because they use forward indices (doc → terms).
Q:Why must index-time and query-time analysis match?
A: If you stem 'running' to 'run' at index time but don't stem the query 'running', the query looks for 'running' in the inverted index but only 'run' exists — no match. Both sides must produce the same tokens. This is why ES applies the same analyzer at both times by default. Using different analyzers (search_analyzer) is only safe when you understand the asymmetry (e.g., edge n-grams at index time, standard at query time).
Q:Explain the difference between text and keyword field types.
A: text fields are analyzed — broken into tokens by an analyzer for full-text search. You can't sort or aggregate on them efficiently. keyword fields are stored as-is — no analysis. They support exact match, sorting, and aggregations but not full-text search. The multi-field pattern gives you both: 'title' as text for search, 'title.raw' as keyword for sort/agg.
Q:Why is Elasticsearch not suitable as a primary database?
A: ES lacks ACID transactions, has no referential integrity, is eventually consistent (refresh interval), and updates are expensive (delete + reindex internally). It's optimized for read-heavy search workloads, not transactional writes. The canonical pattern: PostgreSQL as primary DB (source of truth) → sync to ES for search capabilities.
Common Mistakes
Using text type for exact-match fields
Mapping email, status, or ID fields as 'text'. Queries for 'ACTIVE' don't match because the field was lowercased to 'active'. Aggregations return individual tokens instead of full values.
✅Use 'keyword' for any field that needs exact match, sorting, or aggregation. Use the multi-field pattern (text + keyword sub-field) when you need both full-text search AND exact operations on the same field.
Dynamic mapping in production
Leaving dynamic mapping enabled. A single malformed document with unexpected fields creates hundreds of new field mappings, bloating cluster state and causing performance issues.
✅Always set dynamic: 'strict' in production mappings. This rejects documents with unmapped fields. Define all fields explicitly. Use dynamic: false if you want to store but not index unknown fields.
Using term query on text fields
Running a term query (exact match) on a text field. The field was analyzed to lowercase tokens, but the term query doesn't analyze the input — 'iPhone' doesn't match the indexed token 'iphone'.
✅Use match query for text fields (it analyzes the input). Use term query only for keyword fields. If you need exact match on a text field, add a keyword sub-field and query that instead.
Not using the multi-field pattern
Creating separate fields for search and aggregation (name_search as text, name_exact as keyword), requiring double the storage and complex indexing logic.
✅Use multi-fields: map the field as text with a .raw keyword sub-field. One source field, multiple indexing strategies. Search on 'name', sort/aggregate on 'name.raw'.
Mismatched analyzers between index and query time
Using a custom analyzer at index time but forgetting to set search_analyzer. Or using edge n-grams at both index and query time, causing over-matching.
✅For autocomplete: use edge_ngram analyzer at index time, standard analyzer at query time (search_analyzer). Always test with the _analyze API to verify tokens match expectations.