Elasticsearch ILM Policy JSON Template for Beginners: Production-Grade Lifecycle Orchestration

Deploying an Elasticsearch ILM Policy JSON Template for Beginners in production requires moving beyond static documentation into deterministic lifecycle orchestration. For search engineers, log analytics teams, and DevOps practitioners, Index Lifecycle Management is not a passive retention scheduler; it is a distributed state machine that directly manipulates cluster allocation, shard routing, and index metadata. Misconfigured conditions trigger stuck phases, mapping conflicts during shrink operations, and compliance audit failures. This guide provides a production-grade template, diagnostic workflows for edge-case resolution, and idempotent recovery automation.

flowchart LR
  H["Hot: rollover, set_priority 100"] --> WM["Warm: forcemerge, shrink, allocate"]
  WM --> CD["Cold: allocate, set_priority 10"]
  CD --> DEL["Delete: delete after 30d"]

Production-Grade Policy Blueprint

The following JSON template enforces a deterministic hot-warm-cold-delete progression. It explicitly defines rollover triggers, priority adjustments, and forced merge thresholds to prevent resource contention during phase transitions. Understanding how these phases interact with underlying node roles is foundational to Elasticsearch ILM Architecture & Fundamentals.

PUT _ilm/policy/log-analytics-lifecycle
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_age": "1d",
            "max_primary_shard_size": "50gb",
            "max_docs": 10000000
          },
          "set_priority": { "priority": 100 }
        }
      },
      "warm": {
        "min_age": "2d",
        "actions": {
          "forcemerge": { "max_num_segments": 1 },
          "shrink": { "number_of_shards": 1 },
          "set_priority": { "priority": 50 },
          "allocate": {
            "require": { "data": "warm" },
            "number_of_replicas": 1
          }
        }
      },
      "cold": {
        "min_age": "7d",
        "actions": {
          "set_priority": { "priority": 10 },
          "allocate": {
            "require": { "data": "cold" },
            "number_of_replicas": 0
          }
        }
      },
      "delete": {
        "min_age": "30d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}

Critical Configuration Notes

  • max_primary_shard_size over max_size: Using max_size evaluates total index size, which causes premature rollover if replica count changes dynamically. max_primary_shard_size guarantees deterministic shard sizing regardless of replica topology.
  • shrink prerequisites: The shrink action requires index.routing.allocation.require._name to match a single node, or index.blocks.write: true. ILM automatically sets write blocks, but custom routing rules or insufficient disk watermark thresholds will stall the step.
  • Priority sequencing: Explicit set_priority prevents recovery storms. Without it, warm/cold shards compete equally with hot shards during node restarts, degrading ingestion throughput and violating SLA commitments.

Root-Cause Analysis: Stuck Phases & Exact Diagnostics

ILM operates as an asynchronous coordinator. When a step fails, the policy halts at that phase until manual intervention or automated retry occurs. You must verify cluster health before executing recovery commands.

Step 1: Validate Cluster State

Execute:

GET _cluster/health?pretty

Expected production output for a degraded but recoverable state:

{
  "cluster_name": "prod-es-cluster",
  "status": "yellow",
  "timed_out": false,
  "number_of_nodes": 6,
  "number_of_data_nodes": 4,
  "active_primary_shards": 142,
  "active_shards": 284,
  "relocating_shards": 0,
  "initializing_shards": 0,
  "unassigned_shards": 4,
  "delayed_unassigned_shards": 0,
  "number_of_pending_tasks": 0,
  "number_of_in_flight_fetch": 0,
  "task_max_waiting_in_queue_millis": 0,
  "active_shards_percent_as_number": 98.61
}

A red status or unassigned_shards > 0 immediately preceding an ILM transition indicates allocation failures. Cross-reference with _cat/allocation?v&h=node,shards,disk.indices,disk.used,disk.avail,disk.total,disk.percent.

Step 2: Isolate ILM Failure Point

Execute:

GET <stuck-index-name>/_ilm/explain?pretty

Exact failure output for a stalled shrink operation:

{
  "indices": {
    "logs-analytics-000042": {
      "index": "logs-analytics-000042",
      "managed": true,
      "policy": "log-analytics-lifecycle",
      "lifecycle_date_millis": 1715000000000,
      "phase": "warm",
      "phase_time_millis": 1715086400000,
      "action": "shrink",
      "action_time_millis": 1715086400000,
      "step": "ERROR",
      "step_time_millis": 1715086450000,
      "failed_step": "check-shrink-ready",
      "step_info": {
        "type": "illegal_argument_exception",
        "reason": "index.routing.allocation.require._name must match exactly one node or index.blocks.write must be true"
      }
    }
  }
}

The failed_step and step_info fields dictate the exact remediation path. Do not blindly retry; resolve the underlying constraint first.

Safe Manual Reroutes & Intervention Protocols

Rollover Stalls

If rollover fails due to alias misconfiguration:

  1. Verify alias points to the correct write index: GET _alias/logs-write-index
  2. Remove stale alias: POST _aliases {"actions": [{"remove": {"index": "logs-old", "alias": "logs-write-index"}}]}
  3. Reattach alias: POST _aliases {"actions": [{"add": {"index": "logs-new-000043", "alias": "logs-write-index", "is_write_index": true}}]}
  4. Force ILM retry on the affected index: POST /logs-new-000043/_ilm/retry

Allocation & Shrink Blockers

When allocate or shrink stalls due to node routing conflicts:

  1. Temporarily override routing:
  PUT <index>/_settings
  { "index.routing.allocation.require._name": "es-warm-node-01" }
  1. Verify shard relocation completes: GET _cat/shards/<index>?v
  2. Clear routing override post-shrink: PUT <index>/_settings { "index.routing.allocation.require._name": null }
  3. Retry ILM on the affected index: POST /<index>/_ilm/retry

All manual interventions must be logged. Access control for these operations must align with Securing ILM Policies with RBAC to prevent unauthorized state mutations.

Automated Python v8+ Recovery Patterns

Manual intervention does not scale. The following idempotent Python v8+ script automates stuck-phase detection, validates cluster health thresholds, and executes safe retries with exponential backoff. It complies with standard Python Elasticsearch Client v8 Documentation patterns.

import logging
import time
from elasticsearch import Elasticsearch, ApiError, ConnectionError

logging.basicConfig(level=logging.INFO, format='%(asctime)s [%(levelname)s] %(message)s')
logger = logging.getLogger("ilm_recovery")

class ILMRecoveryAgent:
    def __init__(self, es_client: Elasticsearch, policy_name: str, max_retries: int = 3):
        self.es = es_client
        self.policy = policy_name
        self.max_retries = max_retries

    def check_cluster_health(self, threshold: str = "yellow") -> bool:
        health = self.es.cluster.health(wait_for_status=threshold, timeout="10s")
        status = health["status"]
        if status in ("green", "yellow"):
            logger.info(f"Cluster health stable: {status}")
            return True
        logger.critical(f"Cluster health critical: {status}. Aborting ILM recovery.")
        return False

    def identify_stuck_indices(self) -> list[str]:
        stuck = []
        try:
            explain = self.es.ilm.explain_lifecycle(index="*")
            for idx, data in explain["indices"].items():
                if data.get("step") == "ERROR":
                    stuck.append(idx)
                    logger.warning(f"Stuck index detected: {idx} at phase '{data.get('phase')}' step '{data.get('failed_step')}'")
        except ApiError as e:
            logger.error(f"Failed to fetch ILM explain: {e}")
        return stuck

    def execute_safe_retry(self, index_name: str) -> bool:
        for attempt in range(1, self.max_retries + 1):
            try:
                logger.info(f"Attempting ILM retry for {index_name} (attempt {attempt}/{self.max_retries})")
                self.es.ilm.retry(index=index_name)
                time.sleep(5)  # Allow ILM coordinator to process
                verify = self.es.ilm.explain_lifecycle(index=index_name)
                if verify["indices"][index_name].get("step") != "ERROR":
                    logger.info(f"Successfully recovered {index_name}")
                    return True
            except ApiError as e:
                logger.warning(f"Retry failed for {index_name}: {e}")
                time.sleep(2 ** attempt)
        logger.error(f"Exhausted retries for {index_name}. Escalation required.")
        return False

    def run_recovery_cycle(self):
        if not self.check_cluster_health():
            return
        stuck_indices = self.identify_stuck_indices()
        if not stuck_indices:
            logger.info("No stuck ILM phases detected.")
            return
        for idx in stuck_indices:
            self.execute_safe_retry(idx)

# Usage
if __name__ == "__main__":
    es = Elasticsearch(
        hosts=["https://prod-es-node-01:9200"],
        api_key="YOUR_API_KEY",
        verify_certs=True
    )
    agent = ILMRecoveryAgent(es, policy_name="log-analytics-lifecycle")
    agent.run_recovery_cycle()

Escalation Paths & Compliance Enforcement

When automated retries fail after three cycles, escalate immediately to the infrastructure team with the following artifacts:

  1. Full _cluster/health and _cat/allocation snapshots.
  2. Exact _ilm/explain JSON for all stuck indices.
  3. Node disk watermark logs (_cluster/settings with cluster.routing.allocation.disk.watermark).
  4. Audit trail of manual _cluster/reroute or _ilm/retry executions.

Do not bypass ILM by manually deleting indices or altering index.lifecycle.name without approval. Such actions corrupt retention compliance and violate data governance frameworks. Reference the official Elasticsearch Index Lifecycle Management API for authoritative parameter behavior before modifying production policies.

ILM is a deterministic state machine. Treat it as infrastructure code: version-control your policies, validate phase transitions in staging, and enforce strict RBAC boundaries. Rapid restoration depends on precise diagnostics, not guesswork.