How to Configure Rollover Based on Max Primary Shard Size

Configure an ILM rollover action to fire on max_primary_shard_size so Elasticsearch cuts a new backing index the instant the largest primary shard reaches a fixed byte ceiling — regardless of document count or elapsed time.

This is the size-driven variant of the first lifecycle transition. Rolling on document count or a fixed clock alone produces pathological shard distributions under variable ingest: a traffic spike inflates one primary past the efficient merge band, while a quiet index never rolls at all. Anchoring the trigger to primary shard size gives you deterministic, storage-aware rotation instead. It is one specific choice inside the broader set of rollover conditions, and it is what makes each index hand off cleanly to the later tiers of a hot-warm-cold architecture at a predictable size. The control plane underneath it all is Index Lifecycle Management (ILM), which polls each managed index and advances the write alias when the condition trips.

Why size, specifically? A primary shard between roughly 30 GB and 50 GB is where Lucene segment merges stay efficient and shard recovery stays fast. Below ~10 GB the fixed per-shard overhead (cluster state, thread-pool slots, file handles) dominates and the master node drowns in bookkeeping; above ~50 GB merges slow, recovery stretches into hours, and query fan-out latency climbs non-linearly. max_primary_shard_size is the one condition that tracks that resource directly.

Prerequisites

Elasticsearch 8.x with data-tier node roles assigned (data_hot, data_warm, data_cold) so rolled indices can migrate off the hot tier.
elasticsearch-py v8.0+ (pip install "elasticsearch>=8,<9") — the code below uses the v8 keyword-argument surface, not the legacy body= pattern.
A write alias with exactly one backing index flagged is_write_index: true, plus an index template (or explicit settings) that set index.lifecycle.name and index.lifecycle.rollover_alias.
manage_ilm and manage_index_templates cluster privileges on the service account, scoped per securing ILM policies with RBAC.

Implementation

1. Define the size-driven policy

max_primary_shard_size evaluates the largest primary shard in the current write index. Unlike the deprecated max_size, which summed store across all shards including replicas, this metric ignores replica count entirely — so the threshold you set is the threshold that actually governs merge and query cost. The hot phase below pairs the size cap with a max_age safety net so a low-traffic index still rolls on a predictable cadence; the conditions are OR-combined, so whichever fires first advances the alias.

PUT _ilm/policy/log_pipeline_rollover
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_primary_shard_size": "45gb",
            "max_age": "7d"
          },
          "set_priority": {
            "priority": 100
          }
        }
      },
      "warm": {
        "min_age": "7d",
        "actions": {
          "shrink": { "number_of_shards": 1 },
          "forcemerge": { "max_num_segments": 1 },
          "set_priority": { "priority": 50 }
        }
      }
    }
  }
}

When the largest primary reaches 45 GB, ILM rolls regardless of document count or elapsed time. Sizing the cap at the top of the 30–50 GB band leaves headroom for the segment growth that happens between the poll that detects the breach and the actual cut.

2. Bootstrap the write index and alias

ILM will silently skip evaluation unless index.lifecycle.rollover_alias names an alias that points at the active write target. Bootstrap the first backing index with the alias flagged as the write index and both lifecycle settings attached:

PUT logs-000001
{
  "aliases": {
    "logs-write": { "is_write_index": true }
  },
  "settings": {
    "index.lifecycle.name": "log_pipeline_rollover",
    "index.lifecycle.rollover_alias": "logs-write",
    "number_of_shards": 3,
    "number_of_replicas": 1
  }
}

The -000001 suffix is load-bearing: ILM parses the trailing zero-padded integer and increments it on each rollover, so the alias must be bootstrapped onto an index whose name ends in a number.

3. Automate the bootstrap and stuck-state recovery (Python v8+)

For repeatable deploys and unattended recovery, drive the same sequence through the v8 client. explain_lifecycle is scoped by index (there is no policy argument), and API/HTTP failures surface as ApiError — not TransportError, which is a sibling class and will not catch them.

import logging
from elasticsearch import Elasticsearch, ApiError

logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
logger = logging.getLogger(__name__)


class ILMRolloverManager:
    def __init__(self, es: Elasticsearch):
        self.es = es

    def diagnose_stuck_indices(self, index_pattern: str = "logs-*") -> list[str]:
        # explain_lifecycle is scoped by index (no `policy` parameter in the v8 client).
        response = self.es.ilm.explain_lifecycle(index=index_pattern, human=True)
        stuck = []
        for idx, meta in response["indices"].items():
            # Only the ERROR step is genuinely stuck; an index sitting in
            # hot/rollover/check-rollover-ready is the healthy steady state.
            if meta.get("step") == "ERROR":
                logger.warning("Index %s halted: %s", idx, meta.get("step_info"))
                stuck.append(idx)
        return stuck

    def force_safe_rollover(self, alias: str, max_size: str = "45gb") -> bool:
        try:
            # v8 surface: pass conditions as a keyword, not body=.
            resp = self.es.indices.rollover(
                alias=alias,
                conditions={"max_primary_shard_size": max_size},
            )
            logger.info(
                "Rollover: rolled_over=%s old=%s new=%s",
                resp.get("rolled_over"), resp.get("old_index"), resp.get("new_index"),
            )
            return bool(resp.get("rolled_over"))
        except ApiError as exc:
            logger.error("Rollover failed (%s): %s", exc.status_code, exc.info)
            return False

    def verify_cluster_health(self) -> bool:
        health = self.es.cluster.health(level="indices")
        logger.info(
            "Cluster status=%s unassigned_shards=%s",
            health["status"], health.get("unassigned_shards", 0),
        )
        return health["status"] in ("green", "yellow")


if __name__ == "__main__":
    es = Elasticsearch(
        "https://es-cluster-01:9200",
        api_key="YOUR_BASE64_ENCODED_API_KEY",
        verify_certs=True,
    )
    manager = ILMRolloverManager(es)

    stuck = manager.diagnose_stuck_indices("logs-*")
    if stuck:
        manager.force_safe_rollover("logs-write", "45gb")
        manager.verify_cluster_health()
    else:
        logger.info("No stuck indices — ILM operating within parameters.")

Verification

First confirm the alias resolves to exactly one write index — a fractured alias is the most common cause of a rollover that silently never fires:

GET _cat/aliases/logs-write?v&h=alias,index,is_write_index

Exactly one row must show is_write_index as true, pointing at logs-000001. Next, inspect where ILM thinks the index sits and which step it is polling:

GET logs-000001/_ilm/explain?human

A healthy write index reports the check-rollover-ready step — ILM is actively polling the size condition each indices.lifecycle.poll_interval (default 10m):

{
  "indices": {
    "logs-000001": {
      "index": "logs-000001",
      "managed": true,
      "policy": "log_pipeline_rollover",
      "phase": "hot",
      "action": "rollover",
      "step": "check-rollover-ready",
      "step_info": { "message": "Waiting for index to meet rollover conditions" }
    }
  }
}

If step reports ERROR, the step_info names the fault directly — most often the alias/index mismatch below, where the rollover alias does not point at the index ILM is trying to roll:

{
  "indices": {
    "logs-000001": {
      "managed": true,
      "phase": "hot",
      "action": "rollover",
      "step": "ERROR",
      "step_info": {
        "type": "illegal_argument_exception",
        "reason": "index.lifecycle.rollover_alias [logs-write] does not point to index [logs-000001]"
      }
    }
  }
}

Finally, watch the primary approach the cap so you can confirm the cut lands where you expect. When the largest pri shard store reaches ~45 GB, the next poll should advance the alias to logs-000002:

GET _cat/shards/logs-*?v&h=index,shard,prirep,store&s=store:desc

Gotchas and edge cases

The cap is per primary, not per index. max_primary_shard_size measures the largest single primary, so an index with number_of_shards: 3 rolls when its biggest shard hits 45 GB — meaning total index store is roughly the cap times the primary count (~135 GB here). Choose number_of_shards and the size cap together, or you will roll at a much larger total footprint than you intended.
Rollover is never instant. ILM re-evaluates the state machine every indices.lifecycle.poll_interval (default 10m), so a shard that crosses 45 GB rolls at the next poll, not the moment it breaches. Budget headroom in the cap for the store that accrues during that window, and call POST logs-write/_rollover explicitly for an emergency cut.
A stalled check-rollover-ready is usually a disk watermark, not a broken policy. If shards are UNASSIGNED and _cluster/health is yellow/red, the flood-stage watermark has likely blocked writes on the target tier. Clear the watermark before you touch the policy — force-rolling into a full tier just relocates the problem. The retention-side fallback for a tier with no capacity is covered in fallback routing for data retention.
max_size is deprecated — do not mix it in. It summed store across replicas, so its threshold drifted whenever replica count changed. Define max_primary_shard_size on the rollover action only, and make sure the condition sits in the hot phase, not accidentally in warm, where it does nothing.

FAQ

Why use max_primary_shard_size instead of max_size?

max_size measures total store including replicas, so the number you set drifts as replica counts change and never maps cleanly to any single shard. max_primary_shard_size measures the largest individual primary — the value that actually governs Lucene merge cost, query latency, and recovery time. max_size is deprecated; anchor max_primary_shard_size near the top of the 30–50 GB band.

Should I pair the size cap with max_age?

Yes. Rollover conditions are OR-combined, so a size-only policy will never roll a low-traffic index — its primary may take months to reach 45 GB, leaving stale data pinned to the hot tier past its retention intent. Adding max_age (e.g. 7d) guarantees a fresh index on a predictable cadence while size still caps growth on busy days.

My shard passed 45 GB but the index did not roll — why?

Three usual causes: the poll interval has not elapsed yet (up to 10 minutes is normal); the condition is measured against the largest primary, so a multi-shard index may not have any single shard at the cap even though total store looks large; or the index is not actually managed — check "managed": true and a non-ERROR step in GET <index>/_ilm/explain. A persistent ERROR step names the fault in step_info.

Configuring index rollover conditions — the full set of triggers (max_age, max_docs, max_primary_shard_size) and how to combine them.
Understanding hot-warm-cold architecture — where an index goes after it rolls off the hot tier, and the node roles that route it there.
Monitoring ILM execution and error states — diagnosing a stuck check-rollover-ready or ERROR step in depth.
Building custom ILM policies via the API — authoring the policy-and-template pair that carries this rollover action.

← Back to Understanding Hot-Warm-Cold Architecture · ILM Architecture & Fundamentals

How to Configure Rollover Based on Max Primary Shard Size #

Prerequisites #

Implementation #

1. Define the size-driven policy #

2. Bootstrap the write index and alias #

3. Automate the bootstrap and stuck-state recovery (Python v8+) #

Verification #

Gotchas and edge cases #

FAQ #

Related #