Troubleshooting ILM Phase Transition Delays

Index Lifecycle Management (ILM) stalls are deterministic state machine failures, not random anomalies. When phase transitions halt, the cluster enters a degraded compliance posture, risking unbounded storage growth, retention policy violations, and search degradation. For search engineers, log analytics teams, and DevOps operators, resolving these delays requires strict adherence to diagnostic baselines, exact state interrogation, and idempotent recovery workflows. This guide provides the definitive operational protocol for Troubleshooting ILM Phase Transition Delays.

flowchart TD
  A["explain_lifecycle"] --> B{"step value"}
  B -->|"ERROR: alias collision"| C["Realign write alias"]
  B -->|"allocation wait"| D["Check tiers and watermarks"]
  B -->|"check-rollover-ready"| E["Normal: waiting for an OR condition"]
  C --> R["retry(index)"]
  D --> R

Phase 1: Establish Diagnostic Baseline

ILM evaluates policies on a background thread governed by indices.lifecycle.poll_interval (default 10m). Transitions only occur when all preconditions for the current step are satisfied. Before attempting remediation, capture the cluster allocation state and isolate the exact blocking step.

Execute a synchronous health check to confirm routing stability:

GET _cluster/health?wait_for_status=yellow&timeout=5s

A compliant response must return "status": "green" or "status": "yellow". If "status": "red" or unassigned_shards > 0, resolve shard allocation bottlenecks before proceeding. ILM will not advance indices with pending primary assignments.

Next, extract the precise lifecycle state:

GET logs-app-prod-2024.05.15/_ilm/explain?human

A healthy, transitioning index returns:

{
  "indices": {
    "logs-app-prod-2024.05.15": {
      "index": "logs-app-prod-2024.05.15",
      "managed": true,
      "policy": "logs-retention-policy",
      "lifecycle_date_millis": 1715731200000,
      "phase": "hot",
      "action": "rollover",
      "step": "check-rollover-ready",
      "step_info": null
    }
  }
}

A delayed transition surfaces explicit failure metadata in step_info. This payload is the single source of truth for remediation.

Phase 2: Isolate Step-Level Failure Vectors

Parse step_info to map the exact blocking condition. The following vectors account for >90% of ILM stalls:

1. Rollover Alias Collision

The check-rollover-ready step requires the write alias to resolve to exactly one index. If multiple indices share the alias with is_write_index: true, ILM blocks indefinitely.

"step_info": {
  "type": "illegal_argument_exception",
  "reason": "index [logs-app-prod-2024.05.15] does not have a rollover alias [logs-app-prod] or multiple indices match"
}

Rollover conditions have no precedence: when max_age, max_docs, and max_primary_shard_size are all defined, they are evaluated as a logical OR and rollover fires as soon as any one of them is satisfied. So exceeding max_primary_shard_size triggers a rollover immediately, regardless of max_age. Calibrate thresholds with this OR semantics in mind, as covered in Building Custom ILM Policies via API.

2. Shard Stability & Disk Watermarks

Allocation-sensitive steps such as wait-for-active-shards (rollover) and check-shrink-allocation (shrink) require primary shards to be STARTED and replica allocation to respect cluster.routing.allocation.disk.watermark.low. Unbalanced shard distribution or pending relocations will pause transitions. Verify with:

GET _cat/shards/logs-app-prod-2024.05.15?v&h=index,shard,prirep,state,node

Any INITIALIZING or RELOCATING states indicate active rebalancing. ILM will defer until the cluster stabilizes.

3. Origination Date & min_age Accounting

ILM measures min_age from each index’s lifecycle origination time — by default the index creation time, or the value of the optional index.lifecycle.origination_date setting (also derivable from the index name when index.lifecycle.parse_origination_date is enabled). A misconfigured or backfilled origination_date makes indices appear older or younger than they are, causing premature rollovers or delayed transitions. There is no per-node “timestamp validation” that rejects transitions, but keeping cluster clocks in sync with RFC 5905 compliant NTP implementations or chronyd (e.g. makestep 1.0 3) keeps min_age accounting consistent across the cluster.

Phase 3: Execute Safe Manual Interventions

Do not bypass ILM state machines blindly. Apply targeted, auditable corrections.

Alias Realignment

Remove conflicting write indices and reassign the alias deterministically:

POST _aliases
{
  "actions": [
    { "remove": { "index": "logs-app-prod-2024.05.14", "alias": "logs-app-prod" } },
    { "remove": { "index": "logs-app-prod-2024.05.15", "alias": "logs-app-prod" } },
    { "add": { "index": "logs-app-prod-2024.05.15", "alias": "logs-app-prod", "is_write_index": true } }
  ]
}

Step Retry & Manual Reroute

Once the underlying condition is resolved, re-run the failed step for the affected index:

POST logs-app-prod-2024.05.15/_ilm/retry

If shards remain unassigned due to node failure, execute a safe manual reroute to restore allocation before retrying ILM:

POST _cluster/reroute
{
  "commands": [
    {
      "allocate_replica": {
        "index": "logs-app-prod-2024.05.15",
        "shard": 2,
        "node": "data-node-03"
      }
    }
  ]
}

All interventions must be logged for compliance audits. Lifecycle synchronization across multi-node environments requires strict state tracking, as documented in ILM Policy Design & Lifecycle Synchronization.

Phase 4: Deploy Idempotent Python v8+ Recovery

Manual intervention is insufficient for fleet-scale operations. Implement an automated, idempotent recovery pattern using the official elasticsearch Python client v8+. The following script detects blocked steps, applies corrective alias routing, triggers ILM retry, and verifies advancement.

import logging
from elasticsearch import Elasticsearch, ApiError
from datetime import datetime, timezone

logging.basicConfig(level=logging.INFO, format="%(asctime)s | %(levelname)s | %(message)s")
logger = logging.getLogger("ilm_recovery")

def recover_blocked_ilm(client: Elasticsearch, index_pattern: str, alias_name: str):
    try:
        explain = client.ilm.explain_lifecycle(index=index_pattern)
        blocked = []
        
        for idx, state in explain["indices"].items():
            if state.get("step_info") and state["step_info"].get("type") == "illegal_argument_exception":
                blocked.append(idx)
                
        if not blocked:
            logger.info("No blocked ILM steps detected for pattern: %s", index_pattern)
            return

        for idx in blocked:
            logger.warning("Correcting alias collision for %s", idx)
            # Remove conflicting aliases
            client.indices.put_alias(
                index=index_pattern,
                name=alias_name,
                body={"is_write_index": False}
            )
            # Reassign write index to the latest index
            client.indices.put_alias(index=idx, name=alias_name, body={"is_write_index": True})
            
            # Force ILM re-evaluation
            client.ilm.retry(index=idx)
            logger.info("Triggered ILM retry for %s", idx)
            
        # Verify advancement
        verify = client.ilm.explain_lifecycle(index=blocked[0])
        step_info = verify["indices"][blocked[0]].get("step_info")
        if not step_info:
            logger.info("ILM transition restored successfully.")
        else:
            logger.error("ILM still blocked: %s", step_info)
            
    except ApiError as e:
        logger.error("Elasticsearch API failure during ILM recovery: %s", e.info)
    except Exception as e:
        logger.error("Unexpected recovery failure: %s", str(e))

if __name__ == "__main__":
    es = Elasticsearch(
        "https://es-cluster.internal:9200",
        basic_auth=("elastic", "REDACTED"),
        ca_certs="/etc/elasticsearch/certs/http_ca.crt",
        verify_certs=True
    )
    recover_blocked_ilm(es, "logs-app-prod-*", "logs-app-prod")

This pattern uses the official elasticsearch-py v8 client interface and ensures atomic alias updates before triggering lifecycle retries. Schedule execution via cron or Kubernetes CronJob with exponential backoff to prevent API thrashing.

Phase 5: Escalation Paths & Compliance Verification

If phase transitions remain stalled after alias correction, shard stabilization, and automated retries, escalate immediately. Persistent delays indicate deeper infrastructure failures:

  • Master Election Storms: Verify discovery.seed_hosts and cluster.initial_master_nodes. Frequent leadership changes invalidate ILM state caches.
  • Hardware Degradation: Check GET _cat/allocation?v for disk I/O bottlenecks or failing drives triggering cluster.routing.allocation.disk.watermark.flood_stage.
  • Policy Corruption: Validate JSON syntax and phase dependencies. Malformed phases objects cause silent evaluation drops.

Upon successful restoration, run a compliance verification sweep:

GET _ilm/status
GET logs-app-prod-*/_ilm/explain?filter_path=indices.*.phase,indices.*.step

Confirm all indices report step_info: null and advance through hot → warm → cold → delete within defined SLAs. Maintain immutable audit logs of all _aliases, _cluster/reroute, and _ilm/retry invocations for regulatory review. ILM state machines are unforgiving of manual overrides; enforce strict change control and automated validation to prevent recurrence.