Why query _ilm/explain instead of lowering indices.lifecycle.poll_interval?

Lowering the poll interval makes native ILM check more often but applies the same fixed conditions with no custom logic, no per-index decisioning, and no place to gate on disk headroom or concurrency, while a very low interval adds cluster-state overhead. Reading _ilm/explain from an orchestrator lets you combine age, segment count, and watermark pressure into one decision, bound how many heavy operations run at once, and produce an audit trail.

How many ILM phase transitions can I safely run in parallel?

It is bounded by I/O rather than an ILM setting. shrink and forcemerge saturate disk and CPU on the target nodes, so start around three concurrent heavy operations per dedicated warm or cold node and watch GET _cat/thread_pool/force_merge and disk latency. If the force-merge queue grows or disk latency climbs, lower the ceiling; a slower stable loop beats a fast one that triggers allocation failures.

Automating Phase Transitions with Python

Index Lifecycle Management moves an index through hot, warm, cold, and delete states on a schedule that Elasticsearch evaluates on a fixed timer — and that timer is the problem. Native ILM wakes on indices.lifecycle.poll_interval (ten minutes by default) and only then re-checks whether a phase’s conditions are met, so during a peak ingestion window an index that crossed its rollover threshold can sit in hot for far longer than its policy implies, and a stalled step gives you no signal until you go looking for it. Automating phase transitions with Python replaces that passive wait with an active, state-aware loop: you read each index’s real lifecycle state from _ilm/explain, decide against your own thresholds whether it should advance, and force the transition the moment its preconditions hold. This turns a lifecycle from a set of timers into a programmable workflow, and it is where the intent captured in ILM policy design and lifecycle synchronization becomes measurable cluster behaviour rather than a best-effort schedule.

The operational value is decoupling policy definition from execution timing. A Python orchestrator can weigh index age, segment count, and disk-watermark pressure together, trigger _ilm/explain diagnostics, force a specific step, or fan out a reindex pipeline when it detects schema drift on a warm-tier migration. Because every action is idempotent and version-controlled, this pattern also eliminates the race conditions that appear when several teams edit lifecycle settings by hand: instead of ad-hoc API calls that clobber each other, transitions flow through one auditable code path.

Prerequisites

Confirm the following before you point an orchestrator at a production cluster:

Elasticsearch 8.x cluster with the _ilm/explain, _ilm/move, and _ilm/retry endpoints reachable from the automation host.
elasticsearch-py v8+ installed (elasticsearch>=8.0,<9.0) — the ILM methods below (ilm.explain_lifecycle, ilm.move_to_step, ilm.retry) exist only on the v8 client surface.
An ILM policy already attached to your indices through an index template, with explicit rollover, shrink, forcemerge, and allocate actions and a bootstrapped write alias.
Data-tier node attributes (data_hot, data_warm, data_cold) configured so allocation targets exist — a transition into a tier with no matching node stalls, no matter how the orchestrator forces it.
An API key scoped by RBAC to manage_ilm plus manage on the managed index pattern, so the automation token can move and retry steps but cannot rewrite policies it does not own.
A concurrency budget: know how many simultaneous shrink/forcemerge operations your disk I/O and JVM heap can absorb before you let the loop trigger them in parallel.

Architecture: An Orchestration Loop Over the Native State Machine

The orchestrator does not replace ILM — it sits above the native state machine and drives it deterministically. Each cycle reads the authoritative phase, action, and step for every managed index from _ilm/explain, filters that set down to the indices whose custom thresholds are met, and only then issues a move_to_step or opens a reindex. Verification and alias rerouting close the loop before the next pass begins. The flow below is the control loop that the rest of this page implements; the broader lifecycle synchronization sequence treats this whole block as its “advance” state.

The orchestration loop: read live state, gate candidates on thresholds and physics, force the transition, verify, then repeat on the next pass.

The critical design rule is that forcing a step is only ever safe once its preconditions are independently verified. ILM’s own steps guard themselves — a shrink will not start until shard counts and allocation allow it — but move_to_step bypasses the timer, not the physics. If you force an index into warm/shrink while index.number_of_shards is not divisible by the target shrink factor, or into a tier whose nodes are over their disk high-watermark, the step lands in ERROR immediately. The orchestrator therefore treats every candidate as a gate: age or segment threshold met and disk headroom present and shard geometry compatible, before it acts.

Native ILM can only act on a poll_interval tick, so a threshold crossed mid-interval waits out the remainder; the orchestrator reads state continuously and forces the transition at the moment the threshold is met.

Production Client Initialization and Configuration

Deterministic transitions start from a client that fails predictably. Production initialization must enforce TLS verification, authenticate with a rotatable API key rather than basic credentials, and bound every request with retries and a timeout so a slow node cannot wedge the loop. The following factory returns a hardened v8 client:

from elasticsearch import Elasticsearch
from elasticsearch import ApiError, ConnectionError
import logging

logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
logger = logging.getLogger("ilm-orchestrator")


def get_production_client(nodes: list[str], api_key: str, ca_path: str) -> Elasticsearch:
    """Build a v8 client hardened for unattended automation."""
    return Elasticsearch(
        nodes,
        api_key=api_key,               # rotatable key scoped to manage_ilm; never basic auth
        ca_certs=ca_path,              # pin the cluster CA
        verify_certs=True,             # refuse to talk to an unverified endpoint
        retry_on_timeout=True,         # transient timeouts retry instead of raising
        max_retries=3,                 # bounded so a dead node fails fast
        request_timeout=30,            # per-request ceiling; keep the loop responsive
        sniff_on_start=True,           # discover the full node list at boot
        sniff_on_connection_fail=True, # re-sniff when a node drops out
    )


es = get_production_client(
    nodes=["https://es-node-01:9200", "https://es-node-02:9200"],
    api_key="YOUR_API_KEY",
    ca_path="/etc/elasticsearch/certs/ca.crt",
)

The configuration layer also defines the reindex templates the orchestrator falls back on. When an index migrates to warm or cold and a field type or analyzer has changed since it was created, an in-place transition cannot apply the new mapping — you need a zero-downtime _reindex into a freshly templated target, followed by an alias swap. Tie those targets to the correct tier by declaring index.routing.allocation.include._tier_preference in the policy’s allocate action, so a migration never floods the hot tier with shards that belong on cold nodes. Where and how those shards land is governed by the hot-warm-cold architecture; misrouted shard allocation is the most common hidden cause of what looks like a stuck transition but is really an UNASSIGNED target shard.

Configuration Reference: The Explain Response

Everything the orchestrator decides on comes from the _ilm/explain response, so it is worth being precise about its shape. Each entry under indices is a flat object — the fields you branch on are strings, not nested objects:

GET /logs-*/_ilm/explain
{
  "indices": {
    "logs-000042": {
      "index": "logs-000042",
      "managed": true,
      "policy": "custom-observability-policy",
      "phase": "hot",
      "action": "rollover",
      "step": "check-rollover-ready",
      "lifecycle_date_millis": 1719763200000,
      "phase_time_millis": 1719763200000,
      "step_info": {}
    }
  }
}

The two fields the loop leans on are step — where the literal value "ERROR" means the index is wedged and must be retried, not advanced — and lifecycle_date_millis, from which you derive age. step_info is empty on a healthy index and carries an error.type / error.reason pair when a step has failed; that object is your first diagnostic read whenever a transition refuses to complete.

Step-by-Step Implementation

1. Evaluate transition candidates against custom thresholds

Rather than waiting for the poll interval, query _ilm/explain directly, parse each index’s phase and step, and test it against thresholds you control — age for rollover-ready hot indices, segment count for warm indices due to shrink. Skip anything already terminal or already in ERROR (a broken step is a retry target, not a transition candidate):

import time


def evaluate_transition_candidates(
    index_pattern: str, max_age_days: int = 3, max_segments: int = 10
) -> list[dict]:
    """Return indices ready to advance, judged by age, segment count, and ILM state."""
    candidates: list[dict] = []
    try:
        explain_resp = es.ilm.explain_lifecycle(index=index_pattern)
        indices = explain_resp.get("indices", {})

        for idx_name, idx_data in indices.items():
            phase = idx_data.get("phase", "unknown")
            step = idx_data.get("step", "unknown")            # a string in the explain response
            entered_ms = idx_data.get("lifecycle_date_millis", 0)

            # Terminal or wedged indices are not transition candidates.
            if phase in ("delete", "completed") or step == "ERROR":
                continue

            if phase == "hot" and step == "check-rollover-ready" and entered_ms:
                age_days = (time.time() * 1000 - entered_ms) / 86_400_000
                if age_days >= max_age_days:
                    candidates.append(
                        {"index": idx_name, "phase": phase, "reason": "age_threshold_exceeded"}
                    )

            elif phase == "warm" and step == "shrink":
                segments = es.cat.segments(index=idx_name, format="json")
                if len(segments) > max_segments:
                    candidates.append(
                        {"index": idx_name, "phase": phase, "reason": "segment_threshold_exceeded"}
                    )

    except ApiError as exc:
        logger.error("Failed to evaluate ILM state for %s: %s", index_pattern, exc.info)

    return candidates

2. Force step advancement with an explicit current step

move_to_step requires you to name the step the index is currently on and the step to move it to; the move is rejected if current_step does not match live state, which is a safety feature — it prevents a stale plan from firing against an index that has since moved on. Use the explicit current_step= / next_step= keyword arguments (the v8 idiom) rather than a raw body:

def execute_phase_transition(
    index: str, current: dict, target_phase: str, action: str, step_name: str
) -> None:
    """Force ILM to advance one index, then reroute its query alias.

    `current` must reflect the live step from _ilm/explain, e.g.
    {"phase": "hot", "action": "rollover", "name": "check-rollover-ready"}.
    """
    try:
        es.ilm.move_to_step(
            index=index,
            current_step=current,
            next_step={"phase": target_phase, "action": action, "name": step_name},
        )
        logger.info("Moved %s -> %s/%s/%s", index, target_phase, action, step_name)

        # Drop the index from the write alias so new queries route to the current tier.
        es.indices.update_aliases(
            actions=[
                {"remove": {"index": index, "alias": "logs-query"}},
                {"add": {"index": index, "alias": "logs-query", "is_write_index": False}},
            ]
        )
    except ApiError as exc:
        # A move rejected because current_step drifted, or a precondition failed,
        # surfaces here — route it to a review queue rather than retrying blindly.
        logger.error("Transition failed for %s: %s", index, exc.info)

3. Fall back to a reindex when the mapping has drifted

When a warm or cold migration needs a mapping the source index cannot accept, force-advancing the step will only produce an ERROR. Instead, create a target under the updated template, run _reindex with conflicts="proceed" so pre-existing documents are counted rather than fatal, verify the document count, and swap the alias. The full client-side application of policies to those new targets is covered in using the Python Elasticsearch client to apply ILM policies.

def migrate_with_reindex(source: str, dest: str) -> str:
    """Copy into a re-templated target and return the async task id."""
    response = es.reindex(
        source={"index": source, "size": 2000},
        dest={"index": dest, "op_type": "create"},   # skip docs already copied on a re-run
        conflicts="proceed",                          # count collisions, do not abort
        slices="auto",                                # one sub-task per source shard
        requests_per_second=1000,                     # throttle the write side
        wait_for_completion=False,                    # long copies must not block the loop
    )
    return response["task"]

4. Bound concurrency so transitions do not saturate I/O

Never hand the full candidate list to the deployment at once. shrink and forcemerge are I/O- and CPU-heavy, and firing dozens together spikes disk latency until allocation itself starts failing. Process candidates through a bounded window:

def drive_transitions(candidates: list[dict], max_parallel: int = 3) -> None:
    """Advance candidates in bounded batches to protect cluster I/O."""
    for start in range(0, len(candidates), max_parallel):
        batch = candidates[start : start + max_parallel]
        for c in batch:
            live = es.ilm.explain_lifecycle(index=c["index"])["indices"][c["index"]]
            current = {"phase": live["phase"], "action": live["action"], "name": live["step"]}
            execute_phase_transition(
                c["index"], current, target_phase="warm", action="shrink", step_name="shrink"
            )

Verification

A move_to_step that returns 200 means the request was accepted, not that the transition completed — always confirm the resulting state before you consider an index advanced. Three cheap reads close the loop:

# 1. Confirm the index reached the intended phase/action and is not wedged in ERROR
GET /logs-000042/_ilm/explain

# 2. Confirm shards actually landed on the target tier after an allocate/shrink
GET /_cat/allocation?v&h=node,node.role,disk.percent,shards

# 3. If a shard is UNASSIGNED after the move, ask the cluster why
GET /_cluster/allocation/explain
{ "index": "logs-000042", "shard": 0, "primary": true }

The _ilm/explain response must show the new phase/action and a step other than ERROR; if step is "ERROR", read step_info.error.type for the cause before retrying. The _cat/allocation read confirms the shards followed the tier preference rather than piling back onto the hot nodes, and _cluster/allocation/explain gives the authoritative reason for any shard that refused to move — almost always a disk watermark or a missing node attribute. For turning these checks into standing alerts instead of manual reads, wire them into monitoring ILM execution and error states.

Threshold Tuning and Performance Guidance

The thresholds that drive the loop are a balance between transition velocity and cluster stability, and the wrong settings degrade a deployment faster than native polling ever would.

Age and segment thresholds. Set max_age_days from data-value decay, not a round number — logs commonly drop to warm at 3–7 days. Keep the segment threshold that gates shrink conservative; a warm index with hundreds of small segments benefits from forcemerge, but forcing shrink on an index whose number_of_shards is not divisible by the target factor produces an immediate illegal_argument_exception.
Concurrency ceiling. The single most important tuning knob is max_parallel. forcemerge rewrites segments and blocks indexing on the target shard; three concurrent merges is a safe starting point for a deployment with dedicated warm nodes, but watch GET _cat/thread_pool/force_merge?v&h=node_name,active,queue and back off if the queue grows.
Disk headroom before every move. Read cluster.routing.allocation.disk.watermark.high and compare it against live tier usage before forcing a tier migration. An orchestrator that ignores watermarks simply converts a slow transition into a fast ERROR.
Heap and shard count. Each shard costs heap; a shrink that collapses many shards into one reduces cluster-wide shard pressure, which is often the real win of the warm phase. Size the target so shards stay in the tens-of-GB range, aligning with the rollover conditions that produced the index in the first place — a poll loop cannot fix shards that were sized wrong at rollover.

Troubleshooting

Forced transitions fail predictably, and each failure maps to an observable signal. Work them systematically rather than re-issuing the move.

Symptom	Root cause	Resolution
`step` sits at `ERROR` immediately after a move	Precondition unmet — shard geometry, mapping, or allocation	Read `step_info.error.type` from `_ilm/explain`; fix the cause, then `es.ilm.retry(index=...)` (`POST /<index>/_ilm/retry`) to re-run the failed step.
`move_to_step` returns `illegal_argument_exception`	`current_step` does not match live state, or the index moved since you read it	Re-read `_ilm/explain` immediately before the move and pass the fresh `phase`/`action`/`step` as `current_step`.
Shrink fails with `illegal_argument_exception`	`index.number_of_shards` not divisible by the target shrink factor	Adjust the shrink factor to a divisor, or set the target explicitly and re-run; verify with `GET /<index>/_settings`.
Transition accepted but shards stay `UNASSIGNED`	Target tier over disk high-watermark, or no node carries the tier attribute	Inspect `GET _cluster/allocation/explain`; free disk or add capacity, then `POST /<index>/_ilm/retry`.
`_reindex` aborts with `mapper_parsing_exception`	A source field is incompatible with the new target mapping	Isolate the field via `GET /<target>/_mapping`, apply `coerce`/`ignore_above` or filter the source query, and re-run under `conflicts: proceed`.
Loop blocks for minutes on a large reindex	Copy run with `wait_for_completion=true`	Submit with `wait_for_completion=False` and poll `es.tasks.get(task_id=...)` so the orchestrator thread stays free.

For failures that recur across many indices rather than one — a systematic policy or template defect — capture and route them programmatically as described in handling ILM step execution failures programmatically, and for transitions that are slow rather than broken, work through troubleshooting ILM phase transition delays.

Frequently Asked Questions

Does forcing a step with move_to_step bypass a phase's min_age?

Yes — _ilm/move jumps the index straight to the named step regardless of min_age, which is exactly why it is powerful and dangerous. It does not bypass the step's physical preconditions: a forced shrink still needs compatible shard geometry, and a forced allocate still needs a node in the target tier with disk headroom. Use it only after those preconditions are independently verified, and prefer letting native ILM advance on schedule when timing is not the problem.

What is the difference between ilm.move_to_step and ilm.retry?

retry re-runs the step an index is already stuck on after you have fixed its cause; it applies only to an index whose step is ERROR and does not change which step it is on. move_to_step deliberately relocates an index from one step to a different one, which you use to skip ahead after correcting a policy. Reach for retry to recover a wedged index and move_to_step only to override the schedule on a healthy one.

Why query _ilm/explain instead of just lowering indices.lifecycle.poll_interval?

Lowering the poll interval makes native ILM check more often but still applies the same fixed conditions and gives you no custom logic, no per-index decisioning, and no place to gate on disk headroom or concurrency. A very low interval also adds cluster-state overhead. Reading _ilm/explain from an orchestrator lets you combine age, segment count, and watermark pressure into one decision, bound how many heavy operations run at once, and produce an audit trail — none of which a shorter timer can do.

How many phase transitions can I safely run in parallel?

It is bounded by I/O, not by an ILM setting. shrink and forcemerge saturate disk and CPU on the target nodes, so start around three concurrent heavy operations per dedicated warm/cold node and watch GET _cat/thread_pool/force_merge and disk latency. If the force-merge queue grows or disk latency climbs, lower the ceiling — a slower, stable loop always beats a fast one that triggers allocation failures.

The orchestrator forced a transition but the index will not leave hot. What now?

Read _ilm/explain and check two things: whether step is ERROR (then step_info.error.type names the cause) and whether the target tier can actually accept the shards. Run GET _cluster/allocation/explain — the usual culprit is a disk high-watermark breach or a missing data_warm/data_cold node attribute, both of which halt the move to prevent data loss. Fix the underlying condition, then es.ilm.retry(index=...); re-forcing the move without fixing the cause just re-produces the same ERROR.

Using the Python Elasticsearch client to apply ILM policies — the client-side patterns for attaching and re-applying policies to the targets this orchestrator creates.
Building custom ILM policies via API — the policy payloads and template binding that the transitions here act on.
Monitoring ILM execution and error states — turning _ilm/explain reads into standing alerts on stalled steps.
Handling ILM step execution failures programmatically — capturing and routing recurring step errors from automation.
Troubleshooting ILM phase transition delays — cluster-level tuning when transitions are slow rather than broken.

← Back to ILM Policy Design & Lifecycle Synchronization

Automating Phase Transitions with Python #

Prerequisites #

Architecture: An Orchestration Loop Over the Native State Machine #

Production Client Initialization and Configuration #

Configuration Reference: The Explain Response #

Step-by-Step Implementation #

1. Evaluate transition candidates against custom thresholds #

2. Force step advancement with an explicit current step #

3. Fall back to a reindex when the mapping has drifted #

4. Bound concurrency so transitions do not saturate I/O #

Verification #

Threshold Tuning and Performance Guidance #

Troubleshooting #

Frequently Asked Questions #

Related #

Explore deeper

Related in ILM Policy Design