Configuring Index Rollover Conditions
flowchart LR
W["Write alias"] --> I1["index-000001 (write)"]
I1 --> R{"Rollover condition met? (OR)"}
R -->|"max_age / max_docs / max_primary_shard_size"| I2["index-000002 (new write)"]
R -->|"none met"| I1
W -->|"alias repointed"| I2
Operational Concept & Routing Mechanics
Index rollover is the deterministic trigger that transitions active data ingestion into lifecycle-managed storage. When configuring index rollover conditions, engineers must treat the write alias as the single source of truth for routing. All ingestion clients target the alias, while underlying backing indices become immutable, append-only segments. Rollover conditions (max_age, max_size, max_docs, max_primary_shard_size) dictate when Elasticsearch automatically creates a new backing index and shifts the write alias pointer. This mechanism decouples ingestion throughput from storage tiering, enabling predictable shard allocation and controlled mapping evolution as outlined in Elasticsearch ILM Architecture & Fundamentals.
From a production standpoint, rollover thresholds must align strictly with cluster capacity planning. Overly aggressive conditions generate index fragmentation and increase master node coordination overhead. Conversely, permissive thresholds risk breaching the 50 GB primary shard soft limit, triggering circuit breakers and degrading query performance. Rollover also serves as the handoff point for tier migration. When an index meets its criteria, ILM transitions it into subsequent phases where replica reduction, force-merge, and allocation filtering occur. Proper alignment with Understanding Hot-Warm-Cold Architecture ensures rollover conditions do not starve hot nodes or saturate cold storage before retention boundaries are reached.
Threshold Tuning & Policy Configuration
Production-grade rollover policies require explicit JSON definitions applied via the _ilm/policy API. Prioritize max_primary_shard_size alongside max_age to prevent unbalanced shard distribution and ensure JVM heap efficiency. The following policy demonstrates a structured approach for high-throughput log ingestion:
PUT _ilm/policy/logs-production-rollover
{
"policy": {
"phases": {
"hot": {
"min_age": "0ms",
"actions": {
"rollover": {
"max_primary_shard_size": "50gb",
"max_age": "1d",
"max_docs": 100000000
},
"set_priority": {
"priority": 100
}
}
},
"warm": {
"min_age": "2d",
"actions": {
"allocate": {
"number_of_replicas": 1,
"require": { "data": "warm" }
},
"shrink": {
"number_of_shards": 1
},
"force_merge": {
"max_num_segments": 1
}
}
}
}
}
}Key configuration considerations:
- Shard Allocation Alignment:
max_primary_shard_sizemust be tuned to your cluster’s JVM heap and disk I/O profile. Exceeding 50 GB per primary shard risks degraded search performance and prolonged recovery times. - Time vs. Volume Triggers:
max_ageguarantees predictable index boundaries regardless of ingestion spikes, whilemax_docsandmax_primary_shard_sizeact as safety valves. Always define at least two conditions to prevent runaway indices during traffic anomalies. - Priority Management: The
set_priorityaction ensures hot indices receive preferential thread pool allocation. Lower-priority warm/cold indices will yield resources during peak ingestion windows.
Implementation via REST API & Python v8+ Client
Deploying rollover policies requires a deterministic bootstrap sequence: create the policy, initialize the first index with an explicit write alias, and attach the policy. The official Elasticsearch Rollover API documentation details the exact payload structure, but the following Python v8+ orchestration script automates the entire workflow with production-safe error handling.
from elasticsearch import Elasticsearch, ApiError, NotFoundError
import logging
logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
def bootstrap_rollover_pipeline(es: Elasticsearch, policy_name: str, alias: str):
policy_body = {
"policy": {
"phases": {
"hot": {
"min_age": "0ms",
"actions": {
"rollover": {
"max_primary_shard_size": "50gb",
"max_age": "1d"
},
"set_priority": {"priority": 100}
}
}
}
}
}
try:
# 1. Apply or update ILM policy
es.ilm.put_lifecycle(name=policy_name, body=policy_body)
logging.info(f"Policy '{policy_name}' applied successfully.")
# 2. Bootstrap initial index with write alias
initial_index = f"{alias}-000001"
es.indices.create(
index=initial_index,
body={
"aliases": {alias: {"is_write_index": True}},
"settings": {"index.lifecycle.name": policy_name}
}
)
logging.info(f"Initial index '{initial_index}' created with write alias '{alias}'.")
# 3. Verify ILM attachment
ilm_status = es.ilm.explain_lifecycle(index=initial_index)
step = ilm_status["indices"][initial_index]["phase"]
logging.info(f"ILM attached. Current phase: {step}")
except ApiError as e:
logging.error(f"Elasticsearch API Error: {e.info}")
raise
# Client initialization (v8+ syntax)
es_client = Elasticsearch(
"https://localhost:9200",
basic_auth=("elastic", "YOUR_SECURE_PASSWORD"),
ca_certs="/path/to/http_ca.crt",
verify_certs=True
)
bootstrap_rollover_pipeline(es_client, "logs-production-rollover", "logs-production")For manual testing or emergency rollover during capacity incidents, trigger the transition explicitly:
POST logs-production/_rollover
{
"conditions": {
"max_age": "1d",
"max_primary_shard_size": "50gb"
}
}Production Troubleshooting & Debugging Flows
Rollover failures typically manifest as stuck ILM steps, alias routing drift, or shard size miscalculations. Follow this diagnostic sequence to isolate and resolve operational bottlenecks:
- Verify Alias Write Routing If ingestion returns
400 Bad Request: alias [logs-production] has more than one write index, the alias pointer has fractured.
GET _cat/aliases/logs-production?v&h=alias,index,is_write_indexResolve by explicitly reassigning the write flag: POST /_aliases {"actions": [{"add": {"index": "logs-production-000003", "alias": "logs-production", "is_write_index": true}}]}
- Diagnose Stuck ILM Steps When an index remains in
hotdespite exceeding thresholds, inspect the ILM step history:
GET logs-production-000002/_ilm/explainLook for step_info containing failed_step or waiting_for_snapshot. Common causes include insufficient disk watermarks or pending shard relocation. Adjust cluster routing allocation if necessary.
Shard Size & Circuit Breaker Drift If
max_primary_shard_sizeis ignored, verify thatindex.routing.allocation.total_shards_per_nodeisn’t artificially capping shard creation. Additionally, monitorindices.breaker.total.limitto ensure rollover isn’t triggering parent circuit breakers during heavy segment merging.Policy Permission & RBAC Validation Automated pipelines frequently fail with
security_exceptionwhen service accounts lackmanage_ilmormanage_index_templatesprivileges. Ensure pipeline credentials adhere to least-privilege standards documented in Securing ILM Policies with RBAC. Validate token scopes usingGET _security/_authenticate.Multi-Tenant Namespace Isolation In shared clusters, overlapping rollover conditions across tenants can cause cross-tenant shard contention. Implement tenant-specific aliases and apply index templates with
index.lifecycle.namescoped to distinct policies. Reference Setting Up ILM for Multi-Tenant Log Analytics for namespace routing patterns that prevent policy collision.
For advanced client integration and async polling patterns, consult the Python Elasticsearch Client v8 Documentation. Always validate rollover conditions against staging cluster telemetry before promoting to production routing tables.