Elasticsearch ILM Policy JSON Template for Beginners: Production-Grade Lifecycle Orchestration
Deploying an Elasticsearch ILM Policy JSON Template for Beginners in production requires moving beyond static documentation into deterministic lifecycle orchestration. For search engineers, log analytics teams, and DevOps practitioners, Index Lifecycle Management is not a passive retention scheduler; it is a distributed state machine that directly manipulates cluster allocation, shard routing, and index metadata. Misconfigured conditions trigger stuck phases, mapping conflicts during shrink operations, and compliance audit failures. This guide provides a production-grade template, diagnostic workflows for edge-case resolution, and idempotent recovery automation.
flowchart LR H["Hot: rollover, set_priority 100"] --> WM["Warm: forcemerge, shrink, allocate"] WM --> CD["Cold: allocate, set_priority 10"] CD --> DEL["Delete: delete after 30d"]
Production-Grade Policy Blueprint
The following JSON template enforces a deterministic hot-warm-cold-delete progression. It explicitly defines rollover triggers, priority adjustments, and forced merge thresholds to prevent resource contention during phase transitions. Understanding how these phases interact with underlying node roles is foundational to Elasticsearch ILM Architecture & Fundamentals.
PUT _ilm/policy/log-analytics-lifecycle
{
"policy": {
"phases": {
"hot": {
"min_age": "0ms",
"actions": {
"rollover": {
"max_age": "1d",
"max_primary_shard_size": "50gb",
"max_docs": 10000000
},
"set_priority": { "priority": 100 }
}
},
"warm": {
"min_age": "2d",
"actions": {
"forcemerge": { "max_num_segments": 1 },
"shrink": { "number_of_shards": 1 },
"set_priority": { "priority": 50 },
"allocate": {
"require": { "data": "warm" },
"number_of_replicas": 1
}
}
},
"cold": {
"min_age": "7d",
"actions": {
"set_priority": { "priority": 10 },
"allocate": {
"require": { "data": "cold" },
"number_of_replicas": 0
}
}
},
"delete": {
"min_age": "30d",
"actions": {
"delete": {}
}
}
}
}
}Critical Configuration Notes
max_primary_shard_sizeovermax_size: Usingmax_sizeevaluates total index size, which causes premature rollover if replica count changes dynamically.max_primary_shard_sizeguarantees deterministic shard sizing regardless of replica topology.shrinkprerequisites: Theshrinkaction requiresindex.routing.allocation.require._nameto match a single node, orindex.blocks.write: true. ILM automatically sets write blocks, but custom routing rules or insufficient disk watermark thresholds will stall the step.- Priority sequencing: Explicit
set_priorityprevents recovery storms. Without it, warm/cold shards compete equally with hot shards during node restarts, degrading ingestion throughput and violating SLA commitments.
Root-Cause Analysis: Stuck Phases & Exact Diagnostics
ILM operates as an asynchronous coordinator. When a step fails, the policy halts at that phase until manual intervention or automated retry occurs. You must verify cluster health before executing recovery commands.
Step 1: Validate Cluster State
Execute:
GET _cluster/health?prettyExpected production output for a degraded but recoverable state:
{
"cluster_name": "prod-es-cluster",
"status": "yellow",
"timed_out": false,
"number_of_nodes": 6,
"number_of_data_nodes": 4,
"active_primary_shards": 142,
"active_shards": 284,
"relocating_shards": 0,
"initializing_shards": 0,
"unassigned_shards": 4,
"delayed_unassigned_shards": 0,
"number_of_pending_tasks": 0,
"number_of_in_flight_fetch": 0,
"task_max_waiting_in_queue_millis": 0,
"active_shards_percent_as_number": 98.61
}A red status or unassigned_shards > 0 immediately preceding an ILM transition indicates allocation failures. Cross-reference with _cat/allocation?v&h=node,shards,disk.indices,disk.used,disk.avail,disk.total,disk.percent.
Step 2: Isolate ILM Failure Point
Execute:
GET <stuck-index-name>/_ilm/explain?prettyExact failure output for a stalled shrink operation:
{
"indices": {
"logs-analytics-000042": {
"index": "logs-analytics-000042",
"managed": true,
"policy": "log-analytics-lifecycle",
"lifecycle_date_millis": 1715000000000,
"phase": "warm",
"phase_time_millis": 1715086400000,
"action": "shrink",
"action_time_millis": 1715086400000,
"step": "ERROR",
"step_time_millis": 1715086450000,
"failed_step": "check-shrink-ready",
"step_info": {
"type": "illegal_argument_exception",
"reason": "index.routing.allocation.require._name must match exactly one node or index.blocks.write must be true"
}
}
}
}The failed_step and step_info fields dictate the exact remediation path. Do not blindly retry; resolve the underlying constraint first.
Safe Manual Reroutes & Intervention Protocols
Rollover Stalls
If rollover fails due to alias misconfiguration:
- Verify alias points to the correct write index:
GET _alias/logs-write-index - Remove stale alias:
POST _aliases {"actions": [{"remove": {"index": "logs-old", "alias": "logs-write-index"}}]} - Reattach alias:
POST _aliases {"actions": [{"add": {"index": "logs-new-000043", "alias": "logs-write-index", "is_write_index": true}}]} - Force ILM retry on the affected index:
POST /logs-new-000043/_ilm/retry
Allocation & Shrink Blockers
When allocate or shrink stalls due to node routing conflicts:
- Temporarily override routing:
PUT <index>/_settings
{ "index.routing.allocation.require._name": "es-warm-node-01" }- Verify shard relocation completes:
GET _cat/shards/<index>?v - Clear routing override post-shrink:
PUT <index>/_settings { "index.routing.allocation.require._name": null } - Retry ILM on the affected index:
POST /<index>/_ilm/retry
All manual interventions must be logged. Access control for these operations must align with Securing ILM Policies with RBAC to prevent unauthorized state mutations.
Automated Python v8+ Recovery Patterns
Manual intervention does not scale. The following idempotent Python v8+ script automates stuck-phase detection, validates cluster health thresholds, and executes safe retries with exponential backoff. It complies with standard Python Elasticsearch Client v8 Documentation patterns.
import logging
import time
from elasticsearch import Elasticsearch, ApiError, ConnectionError
logging.basicConfig(level=logging.INFO, format='%(asctime)s [%(levelname)s] %(message)s')
logger = logging.getLogger("ilm_recovery")
class ILMRecoveryAgent:
def __init__(self, es_client: Elasticsearch, policy_name: str, max_retries: int = 3):
self.es = es_client
self.policy = policy_name
self.max_retries = max_retries
def check_cluster_health(self, threshold: str = "yellow") -> bool:
health = self.es.cluster.health(wait_for_status=threshold, timeout="10s")
status = health["status"]
if status in ("green", "yellow"):
logger.info(f"Cluster health stable: {status}")
return True
logger.critical(f"Cluster health critical: {status}. Aborting ILM recovery.")
return False
def identify_stuck_indices(self) -> list[str]:
stuck = []
try:
explain = self.es.ilm.explain_lifecycle(index="*")
for idx, data in explain["indices"].items():
if data.get("step") == "ERROR":
stuck.append(idx)
logger.warning(f"Stuck index detected: {idx} at phase '{data.get('phase')}' step '{data.get('failed_step')}'")
except ApiError as e:
logger.error(f"Failed to fetch ILM explain: {e}")
return stuck
def execute_safe_retry(self, index_name: str) -> bool:
for attempt in range(1, self.max_retries + 1):
try:
logger.info(f"Attempting ILM retry for {index_name} (attempt {attempt}/{self.max_retries})")
self.es.ilm.retry(index=index_name)
time.sleep(5) # Allow ILM coordinator to process
verify = self.es.ilm.explain_lifecycle(index=index_name)
if verify["indices"][index_name].get("step") != "ERROR":
logger.info(f"Successfully recovered {index_name}")
return True
except ApiError as e:
logger.warning(f"Retry failed for {index_name}: {e}")
time.sleep(2 ** attempt)
logger.error(f"Exhausted retries for {index_name}. Escalation required.")
return False
def run_recovery_cycle(self):
if not self.check_cluster_health():
return
stuck_indices = self.identify_stuck_indices()
if not stuck_indices:
logger.info("No stuck ILM phases detected.")
return
for idx in stuck_indices:
self.execute_safe_retry(idx)
# Usage
if __name__ == "__main__":
es = Elasticsearch(
hosts=["https://prod-es-node-01:9200"],
api_key="YOUR_API_KEY",
verify_certs=True
)
agent = ILMRecoveryAgent(es, policy_name="log-analytics-lifecycle")
agent.run_recovery_cycle()Escalation Paths & Compliance Enforcement
When automated retries fail after three cycles, escalate immediately to the infrastructure team with the following artifacts:
- Full
_cluster/healthand_cat/allocationsnapshots. - Exact
_ilm/explainJSON for all stuck indices. - Node disk watermark logs (
_cluster/settingswithcluster.routing.allocation.disk.watermark). - Audit trail of manual
_cluster/rerouteor_ilm/retryexecutions.
Do not bypass ILM by manually deleting indices or altering index.lifecycle.name without approval. Such actions corrupt retention compliance and violate data governance frameworks. Reference the official Elasticsearch Index Lifecycle Management API for authoritative parameter behavior before modifying production policies.
ILM is a deterministic state machine. Treat it as infrastructure code: version-control your policies, validate phase transitions in staging, and enforce strict RBAC boundaries. Rapid restoration depends on precise diagnostics, not guesswork.