🗓️ 04102025 0040
How Elasticsearch processes and tokenizes text for searching.
What is an Analyzer?
An analyzer transforms text into tokens (terms) that can be searched. It runs during:
- Indexing - When documents are added
 - Searching - When queries are executed
 
Key point: Query text must be analyzed the same way as indexed text for matches to work!
Analyzer Components
- Character Filters - Transform characters (e.g., strip HTML, replace 
&→and) - Tokenizer - Split text into tokens (words)
 - Token Filters - Modify tokens (lowercase, stemming, synonyms, stop words)
 
Built-in Analyzers
| Analyzer | Example Input | Output | Use Case | 
|---|---|---|---|
| Standard (default) | "Hello World!" | ["hello", "world"] | General purpose | 
| Simple | "Hello World-123!" | ["hello", "world"] | Splits on non-letters | 
| Whitespace | "Hello World!" | ["Hello", "World!"] | No lowercasing | 
| English | "running quickly" | ["run", "quick"] | English text with stemming | 
| Keyword | "Hello World!" | ["Hello World!"] | No analysis, exact match | 
| Stop | "the quick fox" | ["quick", "fox"] | Removes common words | 
Key Concepts
Tokenization
Breaking text into words/terms.
- Standard tokenizer: Splits on whitespace and punctuation
 - Whitespace tokenizer: Only splits on spaces
 - Pattern tokenizer: Custom regex pattern
 
Normalization
Making tokens searchable.
- Lowercase: "Hello" → "hello" (match case-insensitively)
 - ASCII folding: "café" → "cafe" (remove accents)
 - Stemming: "running" → "run" (match word variations)
 
Stop Words
Common words removed during analysis (e.g., "the", "is", "at").
- Reduces index size
 - Can improve search quality
 - May hurt phrase search accuracy
 
Analyzer Selection Guide
Full-Text Search (Natural Language)
- Standard - General purpose, good default
 - English/[Language] - If specific language (stemming, stop words)
 
Exact Match (IDs, Codes, Email)
- Keyword - No analysis, exact string matching
 
Autocomplete / Prefix Search
- Edge n-gram - Custom analyzer for prefix matching
 - Completion suggester - Built-in autocomplete
 
Case-Sensitive Search
- Whitespace - Keep original casing
 - Custom analyzer without lowercase filter
 
Common Patterns
Multi-Field Mapping
Index same field with different analyzers:
title- Standard analyzer (full-text search)title.keyword- Keyword analyzer (exact match, sorting, aggregations)
Search-Time vs Index-Time Analysis
- Usually use same analyzer for both
 - Exception: Synonym expansion (better at index time)
 
Custom Analyzers
Combine character filters, tokenizers, and token filters to create custom analysis chain.
Analysis vs Search Behavior
| Field Type | Analyzed? | Query Type | Behavior | 
|---|---|---|---|
text | Yes | match | Both analyzed, fuzzy matching | 
text | Yes | term | ⚠️ Query NOT analyzed, usually fails! | 
keyword | No | match | Query analyzed, usually fails! | 
keyword | No | term | ✅ Neither analyzed, exact match | 
Rule: Use match for text fields, term for keyword fields.
Testing Analyzers
Use the _analyze API to see how text is processed:
GET /_analyze
{
  "analyzer": "standard",
  "text": "Hello World!"
}
Test custom field analysis:
GET /my_index/_analyze
{
  "field": "title",
  "text": "Hello World!"
}