ποΈ 06112024 2346
π
elasticsearch_book_chapter_3
Some termsβ
Term | Description |
---|---|
Document | Top-level / root object that is serialized into JSON and stored on ES (has an ID) |
Index | Logical namespace of a group of shards |
Indexed | Stored and made searchable |
Metadataβ
- Required fields:
index
: databasetype
: something like a schemaid
: unique string to identify a document
- These 3 fields uniquely identify a document
- Others (covered in future chapter)
Data is stored and indexed in shards
Specifying own IDβ
- PUT verb (store this document AT this URL) e.g.
PUT /{index}/{type}/{id}
Autogenerating IDβ
- POST verb (store this document under this URL) e.g.
POST /{index}/{type}
- Autogenerated ID:
- 22 characters long
- URL safe
- Base64 encoded string UUIDs
Retrieving dataβ
Entire documentβ
- HTTP GET
Checking if document existsβ
- Use HEAD instead of GET
Deleting a documentβ
Sample response if document found
{
"found" : true,
"_index" : "website",
"_type" : "blog",
"_id" : "123",
"_version" : 3
}
Sample respones if document not found
{
"found" : false,
"_index" : "website",
"_type" : "blog",
"_id" : "123",
"_version" : 4
}
Version is incremented even if not found => internal bookkeeping for ensuring changes are applied in correct order across multiple nodes
Updating a documentβ
- Documents are immutable
- Reindex / replace document when updating
- Version number updated
- Internally (Done in a single API)
- Retrieve old document
- Change
- Delete old document
- Index new document
- Adopts a last-write-wins approach by default
- Uses opimistic concurrency control if version parameter specified
Internally, ES will mark the old document as deleted and has added an entirely new document (will be eventually deleted as more data is added)
Dealing with conflictsβ
Approaches to deal with concurrent updates to ensure that no data is lost
Pessimistic concurrency controlβ
- Assumes conflicting are likely to happen
- Blocks access to a resource in order to prevent conflicts
- e.g. locking a row before reading data
Optmistic concurrency controls (Used by ES)β
- Assumes conflicts are unlikely
- Doesn't block operations
- Underlying data modified between reading and writing => update fails
It's up to the application to handle the failure
- Reattempt update
- Report failure to user
- ...
Example
- Init document
- Update document
PUT /website/blog/1?version=1
=> version update to 2 - Updated document
PUT /website/blog/1?version=1
=> error
Using versions from external systemβ
- Common setup: use some other DB as primary data source, ES to make data searchable
- Can use version number of main DB with ES (e.g. timestamp)
- Handling by ES is a bit different => checks that current
_version
is less than specified version
PUT /website/blog/2?version=5&version_type=external
Partial Updatesβ
- Retrieve-change-reindex process as well
- Happens within a shard => avoid network overhead of multiple requests => reduce likelihood of conflicting changes
Using scriptsβ
- Actually, don't really get the benefit as well
- Default scripting language: Groovy (Runs in a sandbox to prevent malicious users from ES and attacking the server)
- Don't really get this as well
Upsertβ
- Updating a nonexisting document will fail
- Specify
upsert
parameter to create document if it doesn't exist
POST /website/pageviews/1/_update
{
"script" : "ctx._source.views+=1",
"upsert": {
"views": 1
}
}
Updates and conflictsβ
- Smaller window between retrieve / reindex => smaller opportunity for conflicting changes
- But doesn't mean zero chance
- For cases whereby it doesn't matter that a document has been changed, can just reattempt
POST /website/pageviews/1/_update?retry_on_conflict=5
Retry this update five times before failing
{
"script" : "ctx._source.views+=1",
"upsert": {
"views": 0
}
}
Retrieving Multiple Documentsβ
- MGET => Avoids network overhead
- Expects a docs array of required metadata
- Response is successful even if there are missing documents
- Need to rely on
found
flag
bulk
APIβ
- Allow multiple create, index, update, delete requests in a single step
{ action: { metadata }}\n
{ request body }\n
{ action: { metadata }}\n
{ request body }\n
...
- Every line (including the last line) must end with \n for efficient line separation
- Cannot contain unescaped newline characters (must not be pretty printed, will interfere with parsing)
DRYβ
bulk
API accepts a default_index
or_index/_type
How big is too bigβ
- Entire bulk request needs to be loaded into meomry => req too big => less memory available for other requests
Excerpt from book: Fortunately, it is easy to find this sweet spot: Try indexing typical documents in batches of increasing size. When performance starts to drop off, your batch size is too big.
A good place to start is with batches of 1,000 to 5,000 documents or, if your documents are very large, with even smaller batches.
It is often useful to keep an eye on the physical size of your bulk requests. One thouβ sand 1KB documents is very different from one thousand 1MB documents.
A good bulk size to start playing with is around 5-15MB in size
References
- Elasticsearch Oreilly book chapter 3