Troubleshooting ILM Phase Transition Delays

When an Elasticsearch index refuses to roll over or move from hot to warm long after its min_age has elapsed, the cause is almost always a specific, readable blocking step — not a random cluster hiccup — and GET <index>/_ilm/explain names it exactly.

Index Lifecycle Management evaluates policies as a deterministic state machine, so a “delay” is really an index parked on a step whose preconditions are not yet satisfied. This runbook sits at the operational edge of building custom ILM policies via API: once a policy is authored, bound to a template, and bootstrapped, this is how you keep it moving when a transition stalls. It complements the deployment-time checks in that guide and the broader discipline of ILM policy design and lifecycle synchronization, where keeping phase state consistent across a multi-node cluster is the whole game. A stalled transition is not benign: indices pinned in hot inflate the write tier, breach retention SLAs, and starve query performance until they advance.

Prerequisites

Elasticsearch 8.x cluster reachable over HTTPS, with the target data_hot / data_warm / data_cold node roles (or legacy node.attr.data attributes) actually assigned to nodes.
Cluster privilege manage_ilm plus monitor for the diagnostic calls, and manage on the affected indices for alias and reroute fixes — provisioned per securing ILM policies with RBAC.
elasticsearch-py v8.0+ for the automated recovery pattern below (v8 client surface: ilm.explain_lifecycle, ilm.retry, typed ApiError — no legacy body=).
Access to GET _ilm/explain, GET _cluster/health, GET _cat/shards, and GET _cat/allocation.

Why a Transition “Delays”: the Evaluation Loop

ILM does not react instantly. A background coordinator wakes every indices.lifecycle.poll_interval (default 10m) and, for each managed index, checks whether the current step’s preconditions hold. Three independent clocks therefore gate every move: the poll interval (when ILM next looks), the phase min_age (whether the index is old enough), and allocation readiness (whether shards are STARTED on eligible nodes). A transition that looks “stuck” is usually one of these three quietly saying not yet.

The single source of truth for which gate is holding is the step_info object returned by explain. Everything below is about reading it correctly and applying the narrowest safe correction.

Step 1: Establish a Diagnostic Baseline

Before touching a policy, confirm the deployment can allocate shards at all. ILM will not advance an index with pending primary assignments:

GET _cluster/health?wait_for_status=yellow&timeout=5s

A workable response reports "status": "green" or "status": "yellow". If it returns "status": "red" or unassigned_shards > 0, resolve shard allocation first — a red cluster stalls every phase, not just one index.

Next, extract the exact lifecycle state for the affected index:

GET logs-app-prod-2024.05.15/_ilm/explain?human

A healthy, still-progressing index looks like this — note step_info is null and the step is a normal waiting step, not ERROR:

{
  "indices": {
    "logs-app-prod-2024.05.15": {
      "index": "logs-app-prod-2024.05.15",
      "managed": true,
      "policy": "logs-retention-policy",
      "lifecycle_date_millis": 1715731200000,
      "phase": "hot",
      "action": "rollover",
      "step": "check-rollover-ready",
      "step_info": null
    }
  }
}

A parked index instead exposes a populated step_info (or a step of ERROR). That payload maps one-to-one to the fixes in Step 2. If managed is false, the index never attached to a policy at all — that is a binding problem, not a transition delay; go back to building custom ILM policies via API and confirm the index template injected index.lifecycle.name.

Step 2: Isolate the Blocking Step

The following vectors account for the large majority of real-world ILM stalls. Match your step_info to a row, then apply the matching fix in Step 3.

Symptom in `step_info` / `step`	Root cause	Remediation
`illegal_argument_exception`, “does not have a rollover alias … or multiple indices match”	Write alias resolves to zero or several indices with `is_write_index: true`	Realign the alias so exactly one index is the write target, then `_ilm/retry`
Parked on `wait-for-active-shards` or `check-shrink-allocation`	Primaries not `STARTED`, or replicas blocked by a disk watermark	Free disk / fix allocation, wait for shards to stabilise, then retry
Index moves too early or too late versus `min_age`	Misconfigured `index.lifecycle.origination_date` or clock skew	Correct the origination date; keep node clocks in sync
Stuck in `hot` past warm `min_age`, no error	Destination tier has no eligible node with capacity	Assign the tier role / add capacity, or apply fallback routing
`step: ERROR` with `security_exception`	Automation identity lacks a required privilege	Grant the missing privilege via RBAC, then retry

Rollover alias collision

The check-rollover-ready step requires the write alias to resolve to exactly one index. If several indices carry the alias with is_write_index: true (or none do), ILM blocks:

{
  "step_info": {
    "type": "illegal_argument_exception",
    "reason": "index [logs-app-prod-2024.05.15] does not have a rollover alias [logs-app-prod] or multiple indices match"
  }
}

Note that rollover conditions have no precedence among themselves: when max_age, max_docs, and max_primary_shard_size are all defined they are evaluated as a logical OR, and rollover fires as soon as any one is satisfied. Exceeding max_primary_shard_size triggers a rollover immediately, regardless of max_age — so a “premature” rollover is usually correct behaviour, not a bug. Calibrate thresholds with that OR semantics in mind.

Shard stability and disk watermarks

Allocation-sensitive steps such as wait-for-active-shards (rollover) and check-shrink-allocation (shrink) need primaries STARTED and replica placement that respects cluster.routing.allocation.disk.watermark.low. Pending relocations or an unbalanced distribution pause the transition. Inspect the shards directly:

GET _cat/shards/logs-app-prod-2024.05.15?v&h=index,shard,prirep,state,node

Any INITIALIZING or RELOCATING state means the deployment is still rebalancing; ILM defers until it settles. This is the same shard allocation pressure that governs where each phase’s data lands in the hot-warm-cold tiers — a warm transition cannot complete if no warm node has room.

Origination date and `min_age` accounting

ILM measures min_age from each index’s lifecycle origination time — by default the index creation time, or the value of index.lifecycle.origination_date (also derivable from the index name when index.lifecycle.parse_origination_date is enabled). A backfilled or wrong origination_date makes an index look older or younger than it is, causing premature rollovers or delayed transitions. There is no per-node validation that rejects transitions on skew, but keeping cluster clocks aligned with NTP or chronyd keeps min_age accounting consistent across nodes.

Step 3: Apply the Narrowest Safe Fix

Do not bypass the state machine blindly. Apply targeted, auditable corrections and log every one.

Realign the write alias

Remove conflicting write indices and reassign the alias deterministically so exactly one index is writable:

POST _aliases
{
  "actions": [
    { "remove": { "index": "logs-app-prod-2024.05.14", "alias": "logs-app-prod" } },
    { "remove": { "index": "logs-app-prod-2024.05.15", "alias": "logs-app-prod" } },
    { "add": { "index": "logs-app-prod-2024.05.15", "alias": "logs-app-prod", "is_write_index": true } }
  ]
}

Retry the failed step

Once the underlying condition is resolved, re-run the failed step. retry resumes the same step and still respects min_age — it does not skip ahead:

POST logs-app-prod-2024.05.15/_ilm/retry

If shards remain unassigned after a node failure, restore allocation with a targeted reroute before retrying ILM:

POST _cluster/reroute
{
  "commands": [
    {
      "allocate_replica": {
        "index": "logs-app-prod-2024.05.15",
        "shard": 2,
        "node": "data-node-03"
      }
    }
  ]
}

Where a tier legitimately fills up, an ordered fallback routing strategy for data retention keeps shards allocated instead of leaving the transition indefinitely deferred.

Step 4: Automate Recovery with elasticsearch-py v8

Manual intervention does not scale across a fleet of daily indices. The idempotent pattern below detects blocked steps, corrects the alias, retries ILM, and verifies advancement — using the v8 client surface. It reuses the same scoped identity you deploy policies with, and pairs naturally with the scheduled orchestration in automating phase transitions with Python.

import logging
import os
from elasticsearch import Elasticsearch, ApiError

logging.basicConfig(level=logging.INFO, format="%(asctime)s | %(levelname)s | %(message)s")
logger = logging.getLogger("ilm_recovery")

def recover_blocked_ilm(client: Elasticsearch, index_pattern: str, alias_name: str, write_index: str):
    """Detect alias-collision stalls, realign the write alias, and retry — idempotently."""
    try:
        explain = client.ilm.explain_lifecycle(index=index_pattern)

        # Only act on indices whose step_info names an alias/argument failure.
        blocked = [
            idx for idx, state in explain["indices"].items()
            if (state.get("step_info") or {}).get("type") == "illegal_argument_exception"
        ]
        if not blocked:
            logger.info("No blocked ILM steps for pattern: %s", index_pattern)
            return

        for idx in blocked:
            logger.warning("Correcting alias collision for %s", idx)
            # Demote every current holder of the alias, then promote one write index.
            client.indices.update_aliases(actions=[
                {"remove": {"index": index_pattern, "alias": alias_name}},
                {"add": {"index": write_index, "alias": alias_name, "is_write_index": True}},
            ])
            # Re-run the failed step now the precondition holds (respects min_age).
            client.ilm.retry(index=idx)
            logger.info("Triggered ILM retry for %s", idx)

        # Verify the previously blocked index cleared its step_info.
        verify = client.ilm.explain_lifecycle(index=blocked[0])
        if not verify["indices"][blocked[0]].get("step_info"):
            logger.info("ILM transition restored for %s.", blocked[0])
        else:
            logger.error("Still blocked: %s", verify["indices"][blocked[0]]["step_info"])

    except ApiError as exc:
        logger.error("Elasticsearch API failure during recovery: %s %s", exc.meta.status, exc.body)
        raise

if __name__ == "__main__":
    es = Elasticsearch(
        hosts=[os.getenv("ES_HOST", "https://localhost:9200")],
        api_key=os.getenv("ES_API_KEY"),  # scoped manage_ilm + manage on the indices
        verify_certs=True,
    )
    recover_blocked_ilm(es, "logs-app-prod-*", "logs-app-prod", "logs-app-prod-2024.05.15")

Schedule it via cron or a Kubernetes CronJob with exponential backoff so a persistent failure does not thrash the API. update_aliases applies the remove/add atomically, so the alias is never briefly ownerless.

Verification

Confirm the index cleared its block and is advancing on schedule. First re-read the lifecycle state:

GET logs-app-prod-2024.05.15/_ilm/explain?filter_path=indices.*.phase,indices.*.step,indices.*.step_info

A restored index reports step_info absent (null) and a normal step:

{
  "indices": {
    "logs-app-prod-2024.05.15": {
      "phase": "hot",
      "step": "check-rollover-ready"
    }
  }
}

Then confirm the operator itself is running, not paused:

GET _ilm/status

An "operation_mode": "RUNNING" response proves ILM is actively polling — a STOPPED or STOPPING mode explains a deployment-wide freeze that no per-index retry will fix. For ongoing signal rather than a spot check, wire these calls into monitoring ILM execution and error states.

Gotchas and Edge Cases

retry respects min_age; move does not. Use POST <index>/_ilm/retry to resume a step that failed into ERROR after you fixed the cause. Reserve POST _ilm/move/<index> for deliberately jumping to a different step, and only after a verified fix — a wrong target step can strand the index.
A check-rollover-ready step is usually not a bug. With step_info: null, the index is simply waiting for one of its OR-evaluated rollover conditions to trip. Do not “fix” it by lowering thresholds unless the write tier is genuinely saturated.
A globally STOPPED operator masks every per-index diagnosis. If _ilm/status is not RUNNING, run POST _ilm/start (someone likely paused ILM for maintenance) before chasing individual step_info payloads.
Master-election churn invalidates ILM state caches. Frequent leader changes from a bad discovery.seed_hosts / cluster.initial_master_nodes config can make transitions appear to stall intermittently; stabilise the master ring first, since no alias fix survives a flapping cluster.
flood_stage watermark makes indices read-only. Once disk crosses cluster.routing.allocation.disk.watermark.flood_stage, affected indices go read-only and shrink/rollover cannot proceed — free disk and clear the index.blocks.read_only_allow_delete block before retrying.

FAQ

My index passed its min_age hours ago but still hasn't transitioned — why?

min_age being satisfied only makes an index eligible; the transition still waits on the next indices.lifecycle.poll_interval tick (10 minutes by default) and on allocation readiness. Run GET <index>/_ilm/explain: if step_info is null and the step is a normal waiting step, it will move on the next poll. If step_info is populated, that message names the real blocker.

What's the difference between _ilm/retry and _ilm/move?

_ilm/retry re-runs the current step for an index sitting in the ERROR state after you have corrected its root cause; it resumes normal progression and still honours min_age. _ilm/move deliberately relocates an index from one explicit step to another, bypassing normal progression. move can strand an index if the target step is wrong, so use it only for verified manual interventions.

Does _ilm/retry skip the min_age wait?

No. retry only re-attempts the failed step; it does not fast-forward the lifecycle clock. If you need an index to advance before its min_age elapses, that is what _ilm/move is for — and it should be reserved for corrective, one-off actions, not routine acceleration.

Every index across the deployment stopped transitioning at once. Where do I look first?

A deployment-wide freeze is almost never a per-index problem. Check GET _ilm/status — if operation_mode is STOPPED or STOPPING, ILM was paused (often for maintenance); run POST _ilm/start. If it is RUNNING, check GET _cluster/health and GET _cat/allocation?v for a red status, unassigned shards, or a breached flood_stage disk watermark that has frozen allocation everywhere at once.

Building custom ILM policies via API — the deployment-time checks that prevent most of these stalls before they happen.
Automating phase transitions with Python — schedule the idempotent recovery pattern from Step 4 across a fleet.
Monitoring ILM execution and error states — turn one-off explain checks into continuous alerting on stuck steps.
Configuring index rollover conditions — the OR-evaluated max_age / max_docs / max_primary_shard_size triggers behind check-rollover-ready.

← Back to Building Custom ILM Policies via API · ILM Policy Design & Lifecycle Synchronization

Troubleshooting ILM Phase Transition Delays #

Prerequisites #

Why a Transition “Delays”: the Evaluation Loop #

Step 1: Establish a Diagnostic Baseline #

Step 2: Isolate the Blocking Step #

Rollover alias collision #

Shard stability and disk watermarks #

Origination date and min_age accounting #

Step 3: Apply the Narrowest Safe Fix #

Realign the write alias #

Retry the failed step #

Step 4: Automate Recovery with elasticsearch-py v8 #

Verification #

Gotchas and Edge Cases #

FAQ #

Related #