elasticsearch_analyzers | jun ren's digital garden

🗓️ 04102025 0040

ELASTICSEARCH ANALYZERS

How Elasticsearch processes and tokenizes text for searching.

What is an Analyzer?

An analyzer transforms text into tokens (terms) that can be searched. It runs during:

Indexing - When documents are added
Searching - When queries are executed

Key point: Query text must be analyzed the same way as indexed text for matches to work!

Analyzer Components

Character Filters - Transform characters (e.g., strip HTML, replace & → and)
Tokenizer - Split text into tokens (words)
Token Filters - Modify tokens (lowercase, stemming, synonyms, stop words)

Built-in Analyzers

Analyzer	Example Input	Output	Use Case
Standard (default)	"Hello World!"	["hello", "world"]	General purpose
Simple	"Hello World-123!"	["hello", "world"]	Splits on non-letters
Whitespace	"Hello World!"	["Hello", "World!"]	No lowercasing
English	"running quickly"	["run", "quick"]	English text with stemming
Keyword	"Hello World!"	["Hello World!"]	No analysis, exact match
Stop	"the quick fox"	["quick", "fox"]	Removes common words

Key Concepts

Tokenization

Breaking text into words/terms.

Standard tokenizer: Splits on whitespace and punctuation
Whitespace tokenizer: Only splits on spaces
Pattern tokenizer: Custom regex pattern

Normalization

Making tokens searchable.

Lowercase: "Hello" → "hello" (match case-insensitively)
ASCII folding: "café" → "cafe" (remove accents)
Stemming: "running" → "run" (match word variations)

Stop Words

Common words removed during analysis (e.g., "the", "is", "at").

Reduces index size
Can improve search quality
May hurt phrase search accuracy

Analyzer Selection Guide

Full-Text Search (Natural Language)

Standard - General purpose, good default
English/[Language] - If specific language (stemming, stop words)

Exact Match (IDs, Codes, Email)

Keyword - No analysis, exact string matching

Autocomplete / Prefix Search

Edge n-gram - Custom analyzer for prefix matching
Completion suggester - Built-in autocomplete

Case-Sensitive Search

Whitespace - Keep original casing
Custom analyzer without lowercase filter

Common Patterns

Multi-Field Mapping

Index same field with different analyzers:

title - Standard analyzer (full-text search)
title.keyword - Keyword analyzer (exact match, sorting, aggregations)

Search-Time vs Index-Time Analysis

Usually use same analyzer for both
Exception: Synonym expansion (better at index time)

Custom Analyzers

Combine character filters, tokenizers, and token filters to create custom analysis chain.

Analysis vs Search Behavior

Field Type	Analyzed?	Query Type	Behavior
`text`	Yes	`match`	Both analyzed, fuzzy matching
`text`	Yes	`term`	⚠️ Query NOT analyzed, usually fails!
`keyword`	No	`match`	Query analyzed, usually fails!
`keyword`	No	`term`	✅ Neither analyzed, exact match

Rule: Use match for text fields, term for keyword fields.

Testing Analyzers

Use the _analyze API to see how text is processed:

GET /_analyze
{
  "analyzer": "standard",
  "text": "Hello World!"
}

Test custom field analysis:

GET /my_index/_analyze
{
  "field": "title",
  "text": "Hello World!"
}

What is an Analyzer?​

Analyzer Components​

Built-in Analyzers​

Key Concepts​

Tokenization​

Normalization​

Stop Words​

Analyzer Selection Guide​

Full-Text Search (Natural Language)​

Exact Match (IDs, Codes, Email)​

Autocomplete / Prefix Search​

Case-Sensitive Search​

Common Patterns​

Multi-Field Mapping​

Search-Time vs Index-Time Analysis​

Custom Analyzers​

Analysis vs Search Behavior​

Testing Analyzers​

References​