A Simple Guide for Updating Documents in Elasticsearch

Written by rocksetcloud | Published 2024/04/12
Tech Story Tags: elasticsearch | apache-lucene | change-data-capture | open-source-software | performance-optimization | partial-elasticsearch-updates | full-elasticsearch-updates | good-company

TLDRThis blog explores essential strategies for handling updates in Elasticsearch, vital for search and analytics applications. Learn about full updates, partial updates, and scripted updates, along with their implications on CPU utilization. Explore alternatives like Rockset for efficient management of frequent document modifications.via the TL;DR App

Elasticsearch is an open-source search and analytics engine based on Apache Lucene. When building applications on change data capture (CDC) data using Elasticsearch, you’ll want to architect the system to handle frequent updates or modifications to the existing documents in an index.

In this blog, we’ll walk through the different options available for updates including full updates, partial updates, and scripted updates. We’ll also discuss what happens under the hood in Elasticsearch when modifying a document and how frequent updates impact CPU utilization in the system.

Example Application with Frequent Updates

To better understand use cases that have frequent updates, let’s look at a search application for a video streaming service like Netflix. When a user searches for a show, ie “political thriller”, they are returned a set of relevant results based on keywords and other metadata.

Let’s look at an example document in Elasticsearch of the show “House of Cards”:

{
	"name": "House of Cards",
	"description": "Frank Underwood is a Democrat appointed as the Secretary of State. Along with his wife, he sets out on a quest to seek revenge from the people who betrayed him while successfully rising to supremacy.",
	"genres": ["drama", "thriller"],
	"views": 100,
}

The search can be configured in Elasticsearch to use name and description as full-text search fields. The views field, which stores the number of views per title, can be used to boost content, ranking more popular shows higher. The views field is incremented every time a user watches an episode of a show or a movie.

When using this search configuration in an application the scale of Netflix, the number of updates performed can easily cross millions per minute as determined by the Netflix Engagement Report. According to the Report, users watched ~100 billion hours of content from January to July. Assuming an average watch time of 15 minutes per episode or a movie, the number of views per minute reaches 1.3 million on average. With the search configuration specified above, each view would require an update in the millions scale.

Many search and analytics applications can experience frequent updates, especially when built on CDC data.

Performing updates in Elasticsearch

Let’s delve into a general example of how to perform an update in Elasticsearch with the code below:

-
from elasticsearch import Elasticsearch

# Connect to your Elasticsearch instance
es = Elasticsearch([{'host': 'localhost', 'port': 9200}])

# Index name and document ID you want to update
index_name = 'movies'
document_id = 'your_document_id'

# Retrieve the current document to get the current 'views' value
try:
    current_doc = es.get(index=index_name, id=document_id)
    current_views = current_doc['_source']['views']
except Exception as e:
    print(f"Error retrieving current document: {e}")
    current_views = 0  # Set a default value if there's an error

# Define the update body to increment 'views' by 1
update_body = {
    "doc": {
        "views": current_views + 1  # Increment 'views' by 1
    }
}

# Perform the update
try:
    es.update(index=index_name, id=document_id, body=update_body)
    print("Document updated successfully!")
except Exception as e:
    print(f"Error updating document: {e}")

Full Updates versus Partial Updates in Elasticsearch

When performing an update in Elasticsearch, you can use the index API to replace an existing document or the update API to make a partial update to a document.

The index API retrieves the entire document, makes changes to the document, and then reindexes the document. With the update API, you simply send the fields you wish to modify, instead of the entire document. This still results in the document being reindexed but minimizes the amount of data sent over the network. The update API is especially useful in cases where the document size is large and sending the entire document over the network will be time-consuming.

Let’s see how both the index API and the update API work using Python code.

Full updates using the index API in Elasticsearch

from elasticsearch import Elasticsearch

# Connect to Elasticsearch
es = Elasticsearch([{'host': 'localhost', 'port': 9200}])

# Index name and document ID
index_name = "your_index"
document_id = "1"

# Retrieve the existing document
existing_document = es.get(index=index_name, id=document_id)

# Make your changes to the document
existing_document["_source"]["field1"] = "new_value1"
existing_document["_source"]["field2"] = "new_value2"

# Call the index API to perform the full update
es.index(index=index_name, id=document_id, body=existing_document["_source"])

As you can see in the code above, the index API requires two separate calls to Elasticsearch which can result in slower performance and higher load on your cluster.

Partial updates using the update API in Elasticsearch

Partial updates internally use the reindex API, but have been configured to only require a single network call for better performance.

from elasticsearch import Elasticsearch

# Connect to Elasticsearch
es = Elasticsearch([{'host': 'localhost', 'port': 9200}])

# Index name and document ID
index_name = "your_index"
document_id = "1"

# Specify the fields to be updated
update_fields = {
    "field1": "new_value1",
    "field2": "new_value2"
}

# Use the update API to perform a partial update
es.update(index=index_name, id=document_id, body={"doc": update_fields})

You can use the update API in Elasticsearch to update the view count but, by itself, the update API cannot be used to increment the view count based on the previous value. That is because we need the older view count to set the new view count value.

Let’s see how we can fix this using a powerful scripting language, Painless.

Partial updates using Painless scripts in Elasticsearch

Painless is a scripting language designed for Elasticsearch and can be used for query and aggregation calculations, complex conditionals, data transformations and more. Painless also enables the use of scripts in update queries to modify documents based on complex logic.

In the example below, we use a Painless script to perform an update in a single API call and increment the new view count based on the value of the old view count.

from elasticsearch import Elasticsearch

# Connect to your Elasticsearch instance
es = Elasticsearch([{'host': 'localhost', 'port': 9200}])

# Index name and document ID you want to update
index_name = 'movies'
document_id = 'your_document_id'

# Define the Painless script for the update
update_script = {
    "script": {
        "lang": "painless",
        "source": "ctx._source.views += 1"  # Increment 'views' by 1
    }
}

# Perform the update using the Painless script
try:
    es.update(index=index_name, id=document_id, body=update_script)
    print("Document updated successfully!")
except Exception as e:
    print(f"Error updating document: {e}")

The Painless script is pretty intuitive to understand, it is simply incrementing the view count by 1 for every document.

Updating a Nested Object in Elasticsearch

Nested objects in Elasticsearch are a data structure that allows for the indexing of arrays of objects as separate documents within a single parent document. Nested objects are useful when dealing with complex data that naturally form a nested structure, like objects within objects. In a typical Elasticsearch document, arrays of objects are flattened, but using the nested data type allows each object in the array to be indexed and queried independently.

Painless scripts can also be used to update nested objects in Elasticsearch.

from elasticsearch import Elasticsearch

# Connect to your Elasticsearch instance
es = Elasticsearch([{'host': 'localhost', 'port': 9200}])

# Index name and document ID for the example
index_name = 'your_index'
document_id = 'your_document_id'

# Specify the nested field and the updated value
nested_field = "nested_field_name"
updated_value = "new_value"

# Define the Painless script for the update
update_script = {
    "script": {
        "lang": "painless",
        "source": "ctx._source.nested_field_name = params.updated_value",
        "params": {
            "updated_value": updated_value
        }
    }
}

# Perform the update using the Update API and the Painless script
try:
    es.update(index=index_name, id=document_id, body=update_script)
    print("Nested object updated successfully!")
except Exception as e:
    print(f"Error updating nested object: {e}")

Adding a New Field in Elasticsearch

Adding a new field to a document in Elasticsearch can be accomplished through an index operation.

You can partially update an existing document with the new field using the Update API. When dynamic mapping on the index is enabled, introducing a new field is straightforward. Simply index a document containing that field and Elasticsearch will automatically figure out the suitable mapping and add the new field to the mapping.

With dynamic mapping on the index disabled, you will need to use the update mapping API. You can see an example below of how to update the index mapping by adding a “category” field to the movies index.


PUT /movies/_mapping
{
  "properties": {
    "category": {
      "type": "keyword"
    }
  }
}

Updates in Elasticsearch Under the Hood

While the code is simple, Elasticsearch internally is doing a lot of heavy lifting to perform these updates because data is stored in immutable segments. As a result, Elasticsearch cannot simply make an in-place update to a document. The only way to perform an update is to reindex the entire document, regardless of which API is used.

Elasticsearch uses Apache Lucene under the hood. A Lucene index is composed of one or more segments. A segment is a self-contained, immutable index structure that represents a subset of the overall index. When documents are added or updated, new Lucene segments are created and older documents are marked for soft deletion. Over time, as new documents are added or existing ones are updated, multiple segments may accumulate. To optimize the index structure, Lucene periodically merges smaller segments into larger ones.

Updates are essentially inserts in Elasticsearch

Since each update operation is a reindex operation, all updates are essentially inserts with soft deletes.

There are cost implications for treating an update as an insert operation. On one hand, the soft deletion of data means that old data is still being retained for some time, bloating the storage and memory of the index. Performing soft deletes, reindexing and garbage collection operations also take a heavy toll on the CPU, a toll that is exacerbated by repeating these operations on all replicas.

Updates can get more tricky as your product grows and your data changes over time. To keep Elasticsearch performant, you will need to update the shards, analyzers, and tokenizers in your cluster, requiring a reindexing of the entire cluster. For production applications, this will require setting up a new cluster and migrating all of the data over. Migrating clusters is both time-intensive and error prone so it's not an operation to take lightly.

Updates in Elasticsearch

The simplicity of the update operations in Elasticsearch can mask the heavy operational tasks happening under the hood of the system. Elasticsearch treats each update as an insert, requiring the full document to be recreated and reindexed. For applications with frequent updates, this can quickly become expensive as we saw in the Netflix example where millions of updates happen every minute. We recommend either batching updates using the Bulk API, which adds latency to your workload, or looking at alternative solutions when faced with frequent updates in Elasticsearch.

Rockset, a search and analytics database built in the cloud, is a mutable alternative to Elasticsearch. Being built on RocksDB, a key-value store popularized for its mutability, Rockset can make in-place updates to documents. This results in only the value of individual fields being updated and reindexed rather than the entire document.

If you’d like to compare the performance of Elasticsearch and Rockset for update-heavy workloads, you can start a free trial of Rockset with $300 in credits.


Written by rocksetcloud | Real-Time Analytics at Cloud Scale
Published by HackerNoon on 2024/04/12