Monitoring ILM Execution & Error States

Index Lifecycle Management runs as a distributed state machine, not a background daemon that reports its own health. A managed index advances only when the deployment-level coordinator wakes on its poll interval, re-evaluates each index’s phase and step, and finds that allocation, disk, and snapshot preconditions all permit the next action. When one of those preconditions fails, the index does not crash — it parks in an ERROR step or sits silently in a WAITING step, and nothing surfaces unless you are watching for it. Left unmonitored, that silence becomes a hot tier that never rolls over and fills to its flood-stage watermark, or a delete phase that never fires and blows past a compliance retention window. This page is about making lifecycle execution observable: which endpoints carry the authoritative state, how to poll them without hammering the deployment, what to alert on, and how to drive a stuck index back into motion.

Monitoring is the closing edge of the synchronization control loop that governs ILM policy design and lifecycle synchronization: you declare a policy, apply it idempotently, and then continuously reconcile what the deployment is actually running against what you declared. The signal that closes that loop is the explain API, correlated with allocation diagnostics — everything below is how to read that signal, alert on it, and act on it before a retention incident becomes a data-loss incident.

Prerequisites

Elasticsearch 8.x cluster with the ILM coordinator running — confirm GET _ilm/status reports "operation_mode": "RUNNING" before trusting any explain output.
elasticsearch-py v8.0+ installed — this page uses the v8 async client surface (ilm.explain_lifecycle, ilm.retry, keyword-argument bodies, typed exceptions), not the legacy body= pattern or client.indices.* shims.
Data-tier node roles assigned (data_hot, data_warm, data_cold) so a stalled transition points to a real capacity or tagging fault rather than an undefined target tier, per the hot-warm-cold architecture.
Read access to GET _ilm/explain, GET _cat/indices, GET _cluster/allocation/explain, and GET _ilm/status from your monitoring service account.
A destination for structured telemetry — a dedicated monitoring index, a log-shipping pipeline, or an alert manager — so step transitions and error states are captured, not just printed.
manage_ilm on the account that calls POST <index>/_ilm/retry, scoped separately from read-only monitoring per securing ILM policies with RBAC.

Architecture: ILM as an Observable State Machine

Every managed index occupies exactly one (phase, action, step) tuple at any moment. The coordinator polls on indices.lifecycle.poll_interval (default 10m), evaluates whether the current step’s preconditions are met, and either advances the index or leaves it where it is. Observability means turning that opaque internal position into external telemetry: on each of your polling cycles, read the tuple, decide whether it is healthy, drifting, or errored, and emit a structured event. The decision gate below is the whole model — poll, classify, act.

Three states matter. An index in ERROR has hit a hard failure — the explain response carries a failed_step string naming the step that broke and a step_info object with the error type and reason. An index that is not errored but whose time-in-step exceeds a phase-appropriate SLA is drifting: it is technically healthy from ILM’s point of view but has been waiting on a precondition (usually allocation or a watermark) far longer than the phase should take. Everything else is healthy and needs only a heartbeat log. This mirrors the phase mechanics established when building custom ILM policies via the API: each phase runs an ordered, non-negotiable sequence of actions, and monitoring watches the boundaries between them.

Configuration Reference: The Explain Payload

GET <index>/_ilm/explain is the authoritative source of lifecycle state — never infer phase from index age or naming. The response below is annotated to show which fields drive alerting; the timestamps are epoch milliseconds, and failed_step/step_info appear only when the index is in an error state.

GET logs-app-000042/_ilm/explain
{
  "indices": {
    "logs-app-000042": {
      "managed": true,
      "policy": "logs-app-policy",
      "phase": "warm",
      "action": "allocate",
      "step": "check-allocation",
      "phase_time_millis": 1709200000000,
      "action_time_millis": 1709200500000,
      "step_time_millis": 1709200500000,
      "failed_step": "check-allocation",
      "step_info": {
        "type": "illegal_state_exception",
        "reason": "node with [data:warm] attribute not found"
      },
      "phase_execution": { "policy": "logs-app-policy" }
    }
  }
}

For cluster-wide sweeps, the explain call accepts index="*", but on large clusters prefer a cheap tabular probe first and only fetch full explain for indices that look wrong. The _cat/indices view carries the same phase and step without building the full per-index execution object:

GET _cat/indices?v&h=index,health,status,ilm.phase,ilm.step&s=ilm.step

The endpoints that together form the monitoring surface:

Endpoint	What it tells you
`GET <index>/_ilm/explain`	Authoritative `phase`, `action`, `step`, `step_time_millis`, plus `failed_step` and `step_info` on error — the primary drift and error signal.
`GET _cat/indices?v&h=index,ilm.phase,ilm.step`	Cheap tabular phase/step for every index; ideal for a first-pass sweep before heavy explain calls.
`GET _cluster/allocation/explain`	Why a specific shard is unassigned or cannot relocate — the root-cause tool when a transition stalls on allocation.
`GET _ilm/status`	Whether the coordinator is `RUNNING`, `STOPPING`, or `STOPPED` cluster-wide; a `STOPPED` coordinator explains why nothing is advancing.

Step-by-Step Implementation: Async Polling & Structured Telemetry

The monitoring loop reads the tuple, classifies it, and emits a structured event per index. Build it on the v8 async client so a single process can poll a large cluster without blocking, and emit JSON so log-analytics pipelines can index every transition. The pattern below applies exponential-friendly poll intervals, tracks consecutive failures so a one-poll blip does not page anyone, and derives time-in-step from step_time_millis.

Initialize the client with explicit timeouts, bounded retries, and TLS verification so transient network faults never masquerade as ILM failures.
Sweep every managed index with explain_lifecycle(index="*") on a fixed interval.
Classify each index as errored, drifting, or healthy and emit a structured event carrying index, phase, step, and elapsed time.
Escalate on persistence — treat a step that stays in ERROR across two or more polls as a stuck index worth paging, not a transient.

import asyncio
import json
import logging
from datetime import datetime, timezone
from elasticsearch import AsyncElasticsearch
from elasticsearch.exceptions import ConnectionTimeout, ConnectionError, ApiError

# Structured JSON formatter so every ILM transition is machine-indexable.
class JsonFormatter(logging.Formatter):
    def format(self, record):
        log_obj = {
            "timestamp": datetime.now(timezone.utc).isoformat(),
            "level": record.levelname,
            "message": record.getMessage(),
            "module": record.module,
            "func": record.funcName,
        }
        if hasattr(record, "ilm_context"):
            log_obj.update(record.ilm_context)
        return json.dumps(log_obj)

logger = logging.getLogger("ilm_monitor")
logger.setLevel(logging.INFO)
handler = logging.StreamHandler()
handler.setFormatter(JsonFormatter())
logger.addHandler(handler)

async def poll_ilm_state(client: AsyncElasticsearch, poll_interval: int = 30,
                         max_step_duration_ms: int = 3_600_000):
    consecutive_failures: dict[str, int] = {}

    while True:
        try:
            # Sweep all managed indices in one call; the coordinator polls on its own
            # cadence, so a monitoring interval of 30-60s reads settled state cheaply.
            response = await client.ilm.explain_lifecycle(index="*")
            indices = response.get("indices", {})
            now_ms = datetime.now(timezone.utc).timestamp() * 1000

            for idx, state in indices.items():
                phase = state.get("phase")
                step = state.get("step")
                # `failed_step` is the string name of the errored step (set when step == "ERROR").
                failed_step = state.get("failed_step")
                # Elapsed time in the current step, derived from step_time_millis.
                step_time_ms = state.get("step_time_millis", 0)
                exec_time_ms = int(now_ms - step_time_ms) if step_time_ms else 0

                ctx = {"index": idx, "phase": phase, "step": step,
                       "execution_time_ms": exec_time_ms}

                if step == "ERROR" or failed_step:
                    step_info = state.get("step_info", {})
                    ctx["failed_step_name"] = failed_step
                    ctx["error_reason"] = step_info.get("reason", step_info.get("type", "unknown"))
                    consecutive_failures[idx] = consecutive_failures.get(idx, 0) + 1
                    # A single errored poll may be a transient; two in a row is a stuck index.
                    if consecutive_failures[idx] >= 2:
                        logger.error("ILM_STUCK", extra={"ilm_context": ctx})
                    else:
                        logger.warning("ILM_STEP_FAILED", extra={"ilm_context": ctx})
                else:
                    consecutive_failures.pop(idx, None)
                    if exec_time_ms > max_step_duration_ms:
                        logger.warning("ILM_DRIFT_DETECTED", extra={"ilm_context": ctx})
                    else:
                        logger.info("ILM_HEALTHY", extra={"ilm_context": ctx})

        except (ConnectionTimeout, ConnectionError) as e:
            # Network faults are not ILM faults — log distinctly so alerts don't conflate them.
            logger.error(f"CLUSTER_CONNECTIVITY_ERROR: {e}")
        except ApiError as e:
            logger.error(f"API_ERROR: status={e.status_code} error={e.error}")

        await asyncio.sleep(poll_interval)

async def main():
    es = AsyncElasticsearch(
        hosts=["https://es-prod-cluster:9200"],
        api_key="YOUR_BASE64_API_KEY",
        request_timeout=15,
        max_retries=3,
        retry_on_timeout=True,
        verify_certs=True,
    )
    try:
        await poll_ilm_state(es, poll_interval=45, max_step_duration_ms=7_200_000)
    finally:
        await es.close()

if __name__ == "__main__":
    asyncio.run(main())

Two details keep this production-safe. The classifier separates connectivity faults (ConnectionTimeout, ConnectionError) from API faults (ApiError) so a flaky network does not raise a false ILM alert. And failed_step is read as the plain string it is — the error detail lives in the separate step_info object, not inside step, which is a common source of parsing bugs. Driving the actual recovery once a stuck index is detected — the classification of retryable versus terminal failures and the backoff strategy — is covered in handling ILM step execution failures programmatically.

Verification

Confirm the monitor sees what the deployment sees. First, prove the coordinator itself is running — if it is STOPPED, every index is frozen regardless of policy:

GET _ilm/status

{ "operation_mode": "RUNNING" }

Next, isolate any index whose step is not progressing. A step of ERROR, or a step that has not changed across several polls, is the signal to escalate:

GET _cat/indices?v&h=index,health,ilm.phase,ilm.step&s=ilm.step

For any index flagged in that sweep, pull the full explain object and read the failure context directly — the failed_step, plus step_info.type and step_info.reason:

GET logs-app-000042/_ilm/explain?only_errors=true

The only_errors=true query parameter filters the response to just the indices currently in an error step, which turns a deployment-wide explain into a targeted incident list. When explain points at an allocation problem, confirm the root cause before retrying:

GET _cluster/allocation/explain
{ "index": "logs-app-000042", "shard": 0, "primary": true }

A "managed": false in any explain response is its own finding: the index copied its template settings but never adopted a policy — often because an automated reindexing pipeline rebuilt it without re-attaching index.lifecycle.name — so it will silently ignore every phase transition and must be re-bound to the policy.

Threshold Tuning & SLO Alignment

Alert thresholds must track the retention SLO each phase serves, not an arbitrary clock. Set time-in-step limits per phase, and treat a persistent failed_step (present across two or more polls) as a page rather than a warning. Practical starting points:

Hot-phase rollover. Alert if an index sits in check-rollover-ready longer than the ingest baseline predicts — for a daily-cutting index, a step older than ~24 hours signals an ingestion-velocity mismatch or a fractured write alias. Tune the trigger against the same numbers used to set rollover conditions.
Warm/cold allocation. Alert if step stays on check-allocation (the allocate action) or check-shrink-allocation (the shrink action) beyond ~30 minutes. This almost always means node data-tier tags do not match the policy’s routing requirement, or the destination tier crossed cluster.routing.allocation.disk.watermark.high. Where a tier can legitimately run out of room, a fallback routing strategy keeps shards allocated instead of stalling the transition.
Snapshot-bound steps. Alert if step hangs on wait-for-snapshot (delete phase) or mount-snapshot (searchable-snapshot action). Correlate with repository I/O and network throughput to the backup destination — these steps depend on an external system that ILM cannot hurry.

On polling cadence and heap: your monitor’s interval is independent of indices.lifecycle.poll_interval. Polling explain every 30–60 seconds reads settled state cheaply, but calling explain_lifecycle(index="*") builds a per-index execution object for every managed index, so on clusters with thousands of indices prefer the _cat/indices sweep first and reserve full explain for flagged indices — otherwise the monitoring load itself adds master-node and heap pressure during exactly the incidents you are trying to observe. The webhook and retry orchestration that turns these alerts into automated remediation is detailed in automating phase transitions with Python.

Troubleshooting

When an index parks in ERROR, the fix is almost never retry on its own — a blind retry against an unresolved cause simply loops straight back to ERROR. Diagnose the step_info, resolve the underlying constraint, then re-run the failed step with POST <index>/_ilm/retry (which applies only to indices in an error step and does not bypass min_age).

Common failure modes and resolution paths

Symptom (`step`)	Root cause	Resolution
`check-rollover-ready` never advances	No rollover condition met, or the write alias lost `is_write_index`	Verify ingest against thresholds; check `_cat/aliases`; force a cut with `POST <alias>/_rollover` if capacity demands it
`check-allocation` stuck	Missing `data` node-tier tag, or destination tier crossed `disk.watermark.high`	Fix node attributes or free tier capacity; confirm with `_cluster/allocation/explain`, then `_ilm/retry`
`forcemerge` stalled	Circuit breaker tripping or insufficient warm-tier heap	Raise `max_num_segments` or scale the warm tier; watch `_nodes/stats/breaker` for the `parent` breaker
`wait-for-snapshot` hangs	Snapshot repository unreachable or a snapshot already in progress	Validate the repository mount; check `_snapshot/_status` for an active job before retrying
`step: ERROR`, `step_info.reason: "unable to remove policy"`	Policy deleted or renamed out from under a live index	Re-apply the canonical policy artifact, then `_ilm/retry`

Production debugging flow

Isolate the stuck index. GET _cat/indices?v&h=index,ilm.phase,ilm.step and pick out any ilm.step that is ERROR or not advancing.
Extract the failure context. GET <index>/_ilm/explain?only_errors=true and read failed_step, step_info.type, and step_info.reason.
Confirm the root cause. For allocation errors, GET _cluster/allocation/explain with the index and primary shard; for snapshot errors, GET _snapshot/_status.
Resolve the constraint. Fix disk headroom, node tags, snapshot connectivity, or heap pressure — the actual blocker, not the symptom.
Resume the lifecycle. POST <index>/_ilm/retry and confirm the next poll advances the step in explain. Note there is no retry_lifecycle client method — recovery is the per-index client.ilm.retry(index=...) call.
Record the transition. Emit the recovery to the same structured telemetry stream so mean-time-to-recover is measurable across incidents.

FAQ

What is the difference between `step`, `failed_step`, and `step_info`?

step is the index’s current position in the state machine; when a step fails, its value becomes the literal string ERROR. failed_step is a separate field that names which step broke (for example check-allocation), and it appears only in the error case. step_info is an object carrying the machine-readable type (an exception class like illegal_state_exception) and a human-readable reason. Alert on step == "ERROR", page on the failed_step name, and put step_info.reason in the alert body so the on-call engineer sees the cause without a second API call.

How often should the monitor poll `_ilm/explain`?

Independently of indices.lifecycle.poll_interval. The coordinator advances indices on its own 10-minute default cadence, so polling explain every 30–60 seconds is enough to catch an ERROR promptly without adding meaningful load. On clusters with thousands of managed indices, do a cheap _cat/indices sweep at that cadence and only issue a full explain_lifecycle for indices whose step looks wrong or errored — a deployment-wide explain builds a per-index execution object for every index and can add real master-node pressure if run in a tight loop.

Why is my index `"managed": false` even though a policy exists?

The index inherited template settings but never had index.lifecycle.name applied, so ILM ignores it entirely — no phase transitions, no rollover, no deletion. The usual causes are a reindex or restore that recreated the index from a template lacking the lifecycle setting, or a manual index creation that skipped it. Re-attach the policy with an update-settings call (or recreate through the correct template) and confirm "managed": true in a fresh explain. A managed:false index is invisible to normal ILM alerting, which is why the monitor should flag it explicitly.

Does calling `_ilm/retry` skip the phase `min_age` wait?

No. POST <index>/_ilm/retry re-runs only the step that is currently in ERROR; it does not fast-forward the lifecycle or bypass a min_age timer. If you need to deliberately push an index from one step to another — for example to skip past a corrected step after fixing a policy — that is the separate POST _ilm/move/<index> API with explicit current_step and next_step bodies, which should be used sparingly and only after the root cause is verified.

An index is drifting but not in `ERROR`. Is that a real problem?

Often, yes. An index that is not errored but has been in the same step far longer than the phase should take is usually blocked on a precondition ILM treats as a wait rather than a failure — most commonly a WAITING allocation because the target tier lacks capacity or crossed a disk watermark. ILM will not raise an error for this; it simply keeps waiting. That is exactly why the monitor tracks time-in-step from step_time_millis and alerts on drift independently of the ERROR check: a silent stall can breach a retention SLO just as badly as a hard failure.

Handling ILM step execution failures programmatically — classifying retryable versus terminal failures and the backoff strategy behind a safe retry.
Automating phase transitions with Python — turning monitoring alerts into webhook-driven, retry-aware remediation.
Building custom ILM policies via the API — the phase and action definitions whose execution this page observes.
Configuring index rollover conditions — tuning the hot-phase thresholds that a rollover-drift alert is measured against.
Fallback routing for data retention — keeping shards allocated so an allocation step never silently stalls.

← Back to ILM Policy Design & Lifecycle Synchronization

Monitoring ILM Execution & Error States #

Prerequisites #

Architecture: ILM as an Observable State Machine #

Configuration Reference: The Explain Payload #

Step-by-Step Implementation: Async Polling & Structured Telemetry #

Verification #

Threshold Tuning & SLO Alignment #

Troubleshooting #

Common failure modes and resolution paths #

Production debugging flow #

FAQ #

Related #

Explore deeper

Related in ILM Policy Design