Cache Warming Strategies for New Indices

A freshly provisioned Elasticsearch index is a performance liability for its first few minutes of life. Its filesystem page cache, shard request cache, query cache, and field-data structures are all empty, so the opening burst of production queries pays for cold disk reads, on-the-fly doc_values loads, and JIT compilation of query components. When a new index is promoted behind a live write alias — the exact moment a reindex pipeline swaps traffic onto a rebuilt index, or an ILM rollover cuts a new hot index — that cold-start penalty lands directly on user-facing p99 latency. Cache warming closes that gap by replaying a controlled, representative set of queries against the new index before it takes full traffic, so segments, filter bitsets, and aggregation structures are resident by the time real requests arrive. This page covers how to build a deterministic warming manifest, drive it from an async Python v8+ client without starving indexing threads, verify that caches actually filled, and tune the thresholds that keep warming from tripping circuit breakers.

Prerequisites

Confirm each of the following before wiring warming into your cutover automation:

Elasticsearch 8.x cluster with the shard request cache enabled (index.requests.cache.enable: true, the default).
elasticsearch-py v8.x installed, with AsyncElasticsearch available for concurrent warming.
The target index exists and its shards — primaries and replicas — are STARTED, not UNASSIGNED.
doc_values explicitly mapped on every field the warming aggregations touch (timestamps, keyword facets, numeric metrics), to keep aggregation warming off the heap-backed field-data breaker.
A short list of the 5–10 highest-frequency production query shapes, captured from slow logs or your API gateway, to seed the warming manifest.
An automation credential scoped to read and view_index_metadata on the target index pattern — warming needs no write privileges.

Architecture: Where Warming Sits in the Cutover Sequence

Warming is a discrete lifecycle phase, not a side effect of ingestion. It runs after the new index is fully allocated and populated but before the write alias — or full read traffic — points at it. Treating it as its own state keeps warming decoupled from data movement and lets you fail the cutover cleanly if caches never reach a healthy hit ratio. The flow below shows how warming slots into the broader reindex state machine, taking a cold index through manifest-driven query replay to a state where the latency SLO holds.

Warming is its own phase between a STARTED-but-cold index and the alias swap — the manifest replay fills every cache before real traffic arrives.

The critical architectural constraint is that warming must reach every shard copy. The shard request cache and query cache are per-shard: warming a primary does nothing for the replica that a subsequent search might route to. Warming payloads therefore have to be dispatched with routing that fans out across the full shard set, and concurrency has to stay bounded so replay does not compete with the search thread pool that live traffic still depends on.

Configuration Reference

Two layers of configuration govern warming: the index settings that decide what can be cached, and the manifest that decides what gets cached. Start with the index-level settings, applied on the target before warming runs.

PUT /logs-app-prod-000042/_settings
{
  "index.requests.cache.enable": true,
  "index.queries.cache.enabled": true,
  "index.refresh_interval": "5s"
}

Node-level caps bound how much heap those caches may consume. These are set in elasticsearch.yml (or via cluster settings), not per index, but they determine whether warmed entries survive:

# Node query cache: filter bitsets. Default 10% of heap.
indices.queries.cache.size: 10%
# Shard request cache: aggregation/count results. Default 1% of heap.
indices.requests.cache.size: 2%

The warming manifest maps each index pattern to the query shapes worth pre-loading. Keep it tight — a handful of representative payloads that mirror real traffic, not an exhaustive sweep. Every payload runs with size: 0 so warming populates caches without materializing document hits.

{
  "logs-app-prod-*": [
    {
      "query": { "range": { "@timestamp": { "gte": "now-1h", "lte": "now" } } }
    },
    {
      "aggs": { "status_codes": { "terms": { "field": "http.status_code", "size": 50 } } }
    },
    {
      "query": { "bool": { "must": [ { "term": { "service.name": "auth-api" } } ] } }
    }
  ]
}

Range scans warm the page cache and the timestamp doc_values; term aggregations warm both doc_values and the shard request cache for faceted search; the filtered bool query warms the node query cache with the filter bitsets your dashboards reuse. Only cache-eligible clauses benefit — a query cache entry is created for the filter context, so structure warming payloads to mirror the filter/must shapes your application actually issues, not ad-hoc now-based ranges that never repeat verbatim.

Step-by-Step Implementation

The production path is an automation-first pipeline that triggers the moment an index becomes searchable after creation or rollover. The AsyncElasticsearch client dispatches warming queries concurrently under a bounded semaphore, routes copies across shards, and classifies failures so backpressure is retried while fatal errors abort the cutover.

Bound concurrency to the search thread pool. Warming shares the search thread pool with any live read traffic. Cap concurrent warmups at roughly half of thread_pool.search.size (which defaults to int((allocated_processors * 3) / 2) + 1) so replay never fills the search queue.
Route across shard copies. Issue each payload with preference values that reach both primaries and replicas so no shard copy is left cold.
Classify failures deterministically. Treat 429 Too Many Requests as transient backpressure and back off exponentially; treat circuit_breaking_exception, 403, and 503 as fatal and halt so a failing warm never silently precedes an alias swap.

import asyncio
import logging
import time
from elasticsearch import AsyncElasticsearch, ApiError

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("cache_warmer")

# Bounded concurrency aligned with thread_pool.search.size, which defaults to
# int((allocated_processors * 3) / 2) + 1 and therefore varies by node CPU count.
MAX_CONCURRENT_WARMUPS = 12
WARMUP_SEMAPHORE = asyncio.Semaphore(MAX_CONCURRENT_WARMUPS)

WARMING_MANIFEST = {
    "logs-app-prod-*": [
        {"query": {"range": {"@timestamp": {"gte": "now-1h", "lte": "now"}}}},
        {"aggs": {"status_codes": {"terms": {"field": "http.status_code", "size": 50}}}},
        {"query": {"bool": {"must": [{"term": {"service.name": "auth-api"}}]}}},
    ]
}


async def execute_warmup(
    es: AsyncElasticsearch,
    index_name: str,
    query_payload: dict,
    attempt: int = 0,
) -> None:
    async with WARMUP_SEMAPHORE:
        try:
            await es.search(
                index=index_name,
                body=query_payload,
                size=0,                    # populate caches, not document hits
                request_cache=True,        # force shard request cache even for size=0
                request_timeout=30,
                preference="_local",       # route to local shard copies for locality
            )
            logger.info("[OK] Warmed %s (payload %s)", index_name, hash(str(query_payload)))
        except ApiError as e:
            if e.status_code == 429:
                backoff = min(2 ** attempt, 10)
                logger.warning("[BACKPRESSURE] 429 on %s; retry in %ss", index_name, backoff)
                await asyncio.sleep(backoff)
                await execute_warmup(es, index_name, query_payload, attempt + 1)
            elif e.status_code in (403, 503) or "circuit_breaking_exception" in str(e):
                logger.error("[FATAL] Warming aborted for %s: %s", index_name, e.message)
                raise
            else:
                logger.error("[ERROR] Unexpected failure on %s: %s", index_name, e.message)
                raise


async def warm_index(es: AsyncElasticsearch, index_pattern: str, payloads: list) -> None:
    # Resolve the pattern to concrete indices so each rolled-over generation warms.
    resolved = await es.indices.get(index=index_pattern)
    for idx_name in resolved.keys():
        tasks = [execute_warmup(es, idx_name, payload) for payload in payloads]
        await asyncio.gather(*tasks)


async def run_warming_cycle(es: AsyncElasticsearch) -> None:
    start = time.perf_counter()
    logger.info("Starting deterministic cache warming cycle...")
    tasks = [warm_index(es, pattern, queries) for pattern, queries in WARMING_MANIFEST.items()]
    await asyncio.gather(*tasks)
    logger.info("Warming cycle completed in %.2fs", time.perf_counter() - start)


if __name__ == "__main__":
    es_client = AsyncElasticsearch(
        hosts=["https://es-cluster-01:9200"],
        api_key="YOUR_API_KEY",
        verify_certs=True,
        max_retries=3,
        retry_on_timeout=True,
        request_timeout=60,
    )
    try:
        asyncio.run(run_warming_cycle(es_client))
    finally:
        asyncio.run(es_client.close())

Note the v8 client surface: es.search(...), es.indices.get(...), and top-level keyword parameters. Errors raise ApiError carrying status_code and message, which is what the classifier branches on. Because warming is read-only and idempotent, a failed cycle can be re-run verbatim with no risk to the target index — which makes it safe to gate the alias swap on a successful cycle.

The semaphore caps replay at half the search pool while requests fan out to every primary and replica; a 429 retries with backoff, a breaker trip or 4xx/503 aborts before the swap.

Verification

Warming is only successful if the caches actually filled and stayed filled. Confirm with the node stats APIs before promoting the index.

Check the shard request cache — a healthy warm shows hit_count climbing on repeat queries and evictions near zero:

GET /logs-app-prod-000042/_stats/request_cache?human

Check the node query cache for filter-bitset population and eviction pressure:

GET /_nodes/stats/indices/query_cache?human

Confirm both shard copies are STARTED (a cold replica means warming missed it):

GET /_cat/shards/logs-app-prod-000042?v&h=index,shard,prirep,state,node

Expected: every row STARTED, one p and one r per shard number. Then re-issue a representative manifest query and compare its took against the same query pre-warm — a warmed index should return the second identical request several times faster, sourced from cache rather than disk. If request_cache hit_count stays flat across repeated identical queries, the payloads are non-deterministic (an unresolved now expression changes the cache key each call) — pin the range to absolute timestamps for warming.

Threshold Tuning and Performance Guidance

Warming has to be aggressive enough to fill caches quickly but restrained enough to leave the deployment responsive to live traffic. Three boundaries govern the balance:

Search concurrency vs. queue depth. MAX_CONCURRENT_WARMUPS should sit at 50% of thread_pool.search.size on the busiest node. Above that, warming replay competes with production reads and search.queue climbs toward rejection. Derive the ceiling from GET /_nodes/stats/thread_pool rather than hard-coding it, so heterogeneous node sizes are respected.
Aggregation size vs. the request breaker. High-cardinality term aggregations build large bucket structures on the heap during warming. Keep warming size values modest (the top-N buckets your dashboards actually render) and confirm doc_values are mapped so aggregation reads come off disk-backed columnar storage, not the field-data breaker. Heap-backed field data on a high-cardinality field will trip circuit_breaking_exception the instant warming touches it.
Cache size vs. eviction churn. If indices.requests.cache.size (default 1% of heap) is too small for the diversity of warmed payloads, entries evict as fast as they load and warming accomplishes nothing. Watch evictions in the request-cache stats; if they exceed ~15% of capacity within five minutes, the manifest is too broad — consolidate to the top-10 production query templates rather than raising the cache size and starving other indices.

Warming cadence also interacts with concurrent migrations. When warming runs alongside a large copy, stagger it against the batch pipeline in designing batch reindex workflows so the two do not contend for the same thread pools, and calibrate both against the same capacity baselines described in optimizing reindex thresholds and bulk sizes.

Troubleshooting

Live clusters surface a predictable set of warming failures. Each row maps a symptom to its diagnostic API call and the corrective action.

Symptom	Diagnosis	Resolution
`429` errors spike despite low CPU	`GET /_cat/thread_pool/search?v&h=node_name,active,queue,rejected`	Reduce `MAX_CONCURRENT_WARMUPS` to 50% of `thread_pool.search.size`; confirm `indices.requests.cache.size` is not forcing constant evictions.
`circuit_breaking_exception` on the `request` or `parent` breaker during aggregation warming	`GET /_nodes/stats/breaker?human`	Lower the `size` on term aggregations in the manifest; add `track_total_hits: false` to range scans; ensure `doc_values` are enabled on every aggregated field so warming avoids the field-data breaker.
Latency spikes persist after warming completes	`GET /_nodes/stats/indices/query_cache?human` (watch `hit_count` vs. `miss_count` and `evictions`)	Verify `index.queries.cache.enabled: true`; if evictions exceed ~15% of capacity within five minutes the payloads are too diverse — consolidate to the top-10 production templates.
Warming hits only primaries; replicas stay cold	`GET /_cat/shards/<index>?v&h=index,shard,prirep,state,node`	Fan warming out across shard copies rather than pinning `preference` to one node; confirm replicas are allocated and `STARTED` so they are warmable at all.
`request_cache` `hit_count` never rises on repeat queries	Re-run an identical manifest query twice and compare `_shards`/`took`	Replace `now`-relative ranges with absolute timestamps for warming — a moving `now` changes the cache key on every call, so nothing is ever a hit.

If replicas are UNASSIGNED rather than merely cold, the problem is allocation, not warming: shard placement is governed by shard allocation and the hot-warm-cold architecture, and an index.routing.allocation filter that resolves to a tier with no capacity will leave replicas unallocated and therefore permanently un-warmable. Resolve allocation first, then warm.

Frequently Asked Questions

Why do new indices have cold caches even after a reindex copies all the data?

Reindexing writes documents into fresh Lucene segments, but the filesystem page cache, shard request cache, node query cache, and field-data structures are all populated lazily — only when a search actually reads them. A reindex issues writes, not the representative reads that build those caches, so the target index starts cold regardless of how much data it holds. Warming supplies the missing reads before real traffic does.

Does running warming queries with size:0 actually populate the caches?

Yes. size: 0 suppresses document hit materialization but still executes the query and aggregation over the shards, which warms the page cache, loads doc_values, and — with request_cache: true — stores the aggregation result in the shard request cache. Because you are not returning hits, warming stays cheap on the coordinating node while still exercising the expensive per-shard work that cold-start latency comes from.

How do I make sure warming reaches replica shards and not just primaries?

The shard request cache and query cache are per-shard-copy, so warming a primary leaves its replica cold. Dispatch each payload across the full shard set rather than pinning it to a single copy — issue repeated requests so the search routing distributes them, or target specific shard copies via preference. Verify with GET /_cat/shards/<index>?v&h=index,shard,prirep,state: every primary and replica must be STARTED to be warmable at all.

Will cache warming interfere with live production queries on the same cluster?

It can, because warming replay shares the search thread pool with live reads. Bound concurrency to roughly 50% of thread_pool.search.size and monitor search.queue depth; if the queue climbs toward its rejection limit, lower MAX_CONCURRENT_WARMUPS. Warming a new index that has not yet taken traffic is low-risk, but running it against a shared node during peak load without a concurrency cap will steal search capacity from active workloads.

How much of the latency SLO can warming actually recover?

The recoverable portion is the cold-start delta: the difference between first-touch query latency (cold disk reads, doc_values load, filter-bitset construction) and steady-state latency served from cache. On aggregation-heavy dashboards and faceted search this delta is often several-fold on p99 for the first minutes after cutover. Warming does not improve steady-state performance — it just eliminates the transient spike so the alias swap is invisible to users.

Designing batch reindex workflows — staggering warming against a large copy so the two do not contend for thread pools.
Optimizing reindex thresholds & bulk sizes — the capacity baselines that also bound warming concurrency.
Tracking reindex progress & performance — instrumenting the cutover, including the post-warm latency check.
Resolving document conflicts during reindex — the conflict handling that precedes a clean, warmable target.
Understanding hot-warm-cold architecture — how shard allocation decides which copies exist to warm.

← Back to Automated Reindexing Pipelines & Workflows

Cache Warming Strategies for New Indices #

Prerequisites #

Architecture: Where Warming Sits in the Cutover Sequence #

Configuration Reference #

Step-by-Step Implementation #

Verification #

Threshold Tuning and Performance Guidance #

Troubleshooting #

Frequently Asked Questions #

Related #

Related in Reindexing Pipelines