Configuring Index Rollover Conditions

flowchart LR
  W["Write alias"] --> I1["index-000001 (write)"]
  I1 --> R{"Rollover condition met? (OR)"}
  R -->|"max_age / max_docs / max_primary_shard_size"| I2["index-000002 (new write)"]
  R -->|"none met"| I1
  W -->|"alias repointed"| I2

Operational Concept & Routing Mechanics

Index rollover is the deterministic trigger that transitions active data ingestion into lifecycle-managed storage. When configuring index rollover conditions, engineers must treat the write alias as the single source of truth for routing. All ingestion clients target the alias, while underlying backing indices become immutable, append-only segments. Rollover conditions (max_age, max_size, max_docs, max_primary_shard_size) dictate when Elasticsearch automatically creates a new backing index and shifts the write alias pointer. This mechanism decouples ingestion throughput from storage tiering, enabling predictable shard allocation and controlled mapping evolution as outlined in Elasticsearch ILM Architecture & Fundamentals.

From a production standpoint, rollover thresholds must align strictly with cluster capacity planning. Overly aggressive conditions generate index fragmentation and increase master node coordination overhead. Conversely, permissive thresholds risk breaching the 50 GB primary shard soft limit, triggering circuit breakers and degrading query performance. Rollover also serves as the handoff point for tier migration. When an index meets its criteria, ILM transitions it into subsequent phases where replica reduction, force-merge, and allocation filtering occur. Proper alignment with Understanding Hot-Warm-Cold Architecture ensures rollover conditions do not starve hot nodes or saturate cold storage before retention boundaries are reached.

Threshold Tuning & Policy Configuration

Production-grade rollover policies require explicit JSON definitions applied via the _ilm/policy API. Prioritize max_primary_shard_size alongside max_age to prevent unbalanced shard distribution and ensure JVM heap efficiency. The following policy demonstrates a structured approach for high-throughput log ingestion:

PUT _ilm/policy/logs-production-rollover
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_primary_shard_size": "50gb",
            "max_age": "1d",
            "max_docs": 100000000
          },
          "set_priority": {
            "priority": 100
          }
        }
      },
      "warm": {
        "min_age": "2d",
        "actions": {
          "allocate": {
            "number_of_replicas": 1,
            "require": { "data": "warm" }
          },
          "shrink": {
            "number_of_shards": 1
          },
          "force_merge": {
            "max_num_segments": 1
          }
        }
      }
    }
  }
}

Key configuration considerations:

  • Shard Allocation Alignment: max_primary_shard_size must be tuned to your cluster’s JVM heap and disk I/O profile. Exceeding 50 GB per primary shard risks degraded search performance and prolonged recovery times.
  • Time vs. Volume Triggers: max_age guarantees predictable index boundaries regardless of ingestion spikes, while max_docs and max_primary_shard_size act as safety valves. Always define at least two conditions to prevent runaway indices during traffic anomalies.
  • Priority Management: The set_priority action ensures hot indices receive preferential thread pool allocation. Lower-priority warm/cold indices will yield resources during peak ingestion windows.

Implementation via REST API & Python v8+ Client

Deploying rollover policies requires a deterministic bootstrap sequence: create the policy, initialize the first index with an explicit write alias, and attach the policy. The official Elasticsearch Rollover API documentation details the exact payload structure, but the following Python v8+ orchestration script automates the entire workflow with production-safe error handling.

from elasticsearch import Elasticsearch, ApiError, NotFoundError
import logging

logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")

def bootstrap_rollover_pipeline(es: Elasticsearch, policy_name: str, alias: str):
    policy_body = {
        "policy": {
            "phases": {
                "hot": {
                    "min_age": "0ms",
                    "actions": {
                        "rollover": {
                            "max_primary_shard_size": "50gb",
                            "max_age": "1d"
                        },
                        "set_priority": {"priority": 100}
                    }
                }
            }
        }
    }

    try:
        # 1. Apply or update ILM policy
        es.ilm.put_lifecycle(name=policy_name, body=policy_body)
        logging.info(f"Policy '{policy_name}' applied successfully.")

        # 2. Bootstrap initial index with write alias
        initial_index = f"{alias}-000001"
        es.indices.create(
            index=initial_index,
            body={
                "aliases": {alias: {"is_write_index": True}},
                "settings": {"index.lifecycle.name": policy_name}
            }
        )
        logging.info(f"Initial index '{initial_index}' created with write alias '{alias}'.")

        # 3. Verify ILM attachment
        ilm_status = es.ilm.explain_lifecycle(index=initial_index)
        step = ilm_status["indices"][initial_index]["phase"]
        logging.info(f"ILM attached. Current phase: {step}")

    except ApiError as e:
        logging.error(f"Elasticsearch API Error: {e.info}")
        raise

# Client initialization (v8+ syntax)
es_client = Elasticsearch(
    "https://localhost:9200",
    basic_auth=("elastic", "YOUR_SECURE_PASSWORD"),
    ca_certs="/path/to/http_ca.crt",
    verify_certs=True
)

bootstrap_rollover_pipeline(es_client, "logs-production-rollover", "logs-production")

For manual testing or emergency rollover during capacity incidents, trigger the transition explicitly:

POST logs-production/_rollover
{
  "conditions": {
    "max_age": "1d",
    "max_primary_shard_size": "50gb"
  }
}

Production Troubleshooting & Debugging Flows

Rollover failures typically manifest as stuck ILM steps, alias routing drift, or shard size miscalculations. Follow this diagnostic sequence to isolate and resolve operational bottlenecks:

  1. Verify Alias Write Routing If ingestion returns 400 Bad Request: alias [logs-production] has more than one write index, the alias pointer has fractured.
  GET _cat/aliases/logs-production?v&h=alias,index,is_write_index

Resolve by explicitly reassigning the write flag: POST /_aliases {"actions": [{"add": {"index": "logs-production-000003", "alias": "logs-production", "is_write_index": true}}]}

  1. Diagnose Stuck ILM Steps When an index remains in hot despite exceeding thresholds, inspect the ILM step history:
  GET logs-production-000002/_ilm/explain

Look for step_info containing failed_step or waiting_for_snapshot. Common causes include insufficient disk watermarks or pending shard relocation. Adjust cluster routing allocation if necessary.

  1. Shard Size & Circuit Breaker Drift If max_primary_shard_size is ignored, verify that index.routing.allocation.total_shards_per_node isn’t artificially capping shard creation. Additionally, monitor indices.breaker.total.limit to ensure rollover isn’t triggering parent circuit breakers during heavy segment merging.

  2. Policy Permission & RBAC Validation Automated pipelines frequently fail with security_exception when service accounts lack manage_ilm or manage_index_templates privileges. Ensure pipeline credentials adhere to least-privilege standards documented in Securing ILM Policies with RBAC. Validate token scopes using GET _security/_authenticate.

  3. Multi-Tenant Namespace Isolation In shared clusters, overlapping rollover conditions across tenants can cause cross-tenant shard contention. Implement tenant-specific aliases and apply index templates with index.lifecycle.name scoped to distinct policies. Reference Setting Up ILM for Multi-Tenant Log Analytics for namespace routing patterns that prevent policy collision.

For advanced client integration and async polling patterns, consult the Python Elasticsearch Client v8 Documentation. Always validate rollover conditions against staging cluster telemetry before promoting to production routing tables.