ILM Policy Design & Lifecycle Synchronization

Production Elasticsearch clusters degrade rapidly when lifecycle management relies on implicit defaults or ad-hoc manual intervention. ILM Policy Design & Lifecycle Synchronization is the discipline of treating every lifecycle definition as a versioned, idempotent contract that stays byte-for-byte consistent across every environment it targets. This domain covers how to author deterministic phase state machines, bind them to index templates, route shards across tiers, and continuously reconcile the policy you declared with the policy the deployment is actually running. It matters in production because a single unsynchronized policy edit can silently delete a week of log data, exhaust hot-tier disk during a traffic spike, or strand indices in a phase they can never leave. Engineers must treat ILM as a distributed control plane, governed by the same rigor as any other piece of infrastructure-as-code.

This page is the anchor for three deeper areas: building custom ILM policies via the REST API, automating phase transitions with Python, and monitoring ILM execution and error states. It sits alongside the broader ILM architecture and fundamentals and automated reindexing pipelines that lifecycle policies must stay coordinated with.

Synchronization as a Control Loop

Policy synchronization is a closed feedback loop, not a one-shot deploy. You define the desired state in version control, apply it idempotently, verify what the deployment actually adopted, and reconcile any drift back to the source of truth. The diagram below shows that loop end to end.

Synchronization is a closed loop: declare, validate, apply idempotently, reconcile, and converge any drift back to the source of truth.

The critical property is idempotency: applying the same policy definition twice must be a no-op, and applying a changed definition must converge the deployment to the new state without ever leaving indices in a half-migrated condition. Everything that follows — tier routing, phase mechanics, RBAC, automation, and observability — exists to keep this loop closed and auditable.

Core Architecture & Tier Routing

ILM operates as a declarative state machine bound to index templates and evaluated by a single cluster-level coordinator that wakes on the indices.lifecycle.poll_interval (10 minutes by default). Understanding the topology it drives is the prerequisite for any synchronization strategy.

Node roles and allocation awareness

Tiered lifecycle management assumes a hot-warm-cold architecture in which nodes advertise their tier through the node.roles list (data_hot, data_warm, data_cold, data_frozen) or, in older topologies, custom attributes such as node.attr.data: hot. When a policy enters a new phase, the coordinator rewrites the target index’s allocation settings — either the modern index.routing.allocation.include._tier_preference or the legacy index.routing.allocation.require.data — and the shard allocator relocates shards to nodes that satisfy the constraint. If no eligible node has capacity, the shard stays put and the phase stalls in a WAITING state rather than failing loudly.

Allocation awareness and disk watermarks are therefore first-class inputs to lifecycle design. The allocator honors cluster.routing.allocation.disk.watermark.low, .high, and .flood_stage; when a target tier crosses the high watermark, relocation halts and the index cannot complete its transition. Designing a policy without accounting for the free space on the destination tier is the single most common cause of stuck warm and cold migrations. Where a tier can legitimately run out of capacity, a fallback routing strategy for data retention using an ordered _tier_preference (for example data_warm,data_hot) keeps shards allocated instead of leaving them unassigned.

Binding a policy to a template

Every policy must be attached through an index template so that new backing indices inherit it automatically. Define index.lifecycle.name to attach the policy, and — for any rollover-managed data stream or alias — index.lifecycle.rollover_alias to decouple write traffic from the physical index behind it.

PUT _index_template/logs-app-template
{
  "index_patterns": ["logs-app-*"],
  "template": {
    "settings": {
      "index.lifecycle.name": "logs-app-policy",
      "index.lifecycle.rollover_alias": "logs-app",
      "index.number_of_shards": 3,
      "index.number_of_replicas": 1
    }
  }
}

Relying on a default max_primary_shard_size without accounting for segment count, mapping overhead, or query patterns guarantees uneven disk pressure and throttled ingestion. Target 30–50 GB per primary shard for healthy Lucene merge behavior, and size the rollover trigger to land inside that band under peak ingestion velocity. The exact payload structure required for deterministic policy attachment and version-controlled rollout across staging and production is detailed in building custom ILM policies via API.

The coordinator rewrites tier preference and the allocator relocates shards on each min_age; a destination tier that has crossed disk.watermark.high cannot accept the shard, stalling the transition.

Phase Transition Mechanics

Transitions between hot, warm, cold, frozen, and delete are not automatic; they execute only when the phase’s min_age has elapsed and cluster allocation permits the required shard movement. Each phase runs an ordered, non-negotiable sequence of actions — the coordinator will not begin the next action until the current one reports complete.

Hot phase: rollover and priority

The hot phase is the only phase that permits writes. Its defining action is rollover, which caps the size of any single backing index. Rollover conditions are evaluated as a logical OR — the first threshold to trip fires the rollover — so pair a size ceiling with an age ceiling to bound both large and low-volume indices.

"hot": {
  "min_age": "0ms",
  "actions": {
    "rollover": {
      "max_primary_shard_size": "50gb",
      "max_age": "1d"
    },
    "set_priority": { "priority": 100 }
  }
}

Avoid overlapping conditions that can trigger competing rollovers during a burst; a single size threshold plus a single age threshold is almost always the correct design.

Warm phase: force merge and relocation

Once an index rolls over, it becomes eligible for the warm phase after its min_age. Warm typically force-merges each shard down toward a single segment to reclaim heap and speed up queries, then relocates the index to warm-tier nodes.

"warm": {
  "min_age": "2d",
  "actions": {
    "forcemerge": { "max_num_segments": 1 },
    "allocate": { "number_of_replicas": 1 },
    "set_priority": { "priority": 50 }
  }
}

Force merge is CPU- and IO-intensive and irreversible for the segments it rewrites — never run it against an index still receiving writes, which is why it is scoped to post-rollover phases only.

Cold and delete phases

The cold phase makes an index read-only and, where storage economics demand it, mounts it as a searchable snapshot to move primary data off local disk. The delete phase removes the index — and optionally its snapshot — once the retention window closes.

"cold": {
  "min_age": "30d",
  "actions": {
    "searchable_snapshot": { "snapshot_repository": "archive-repo" },
    "set_priority": { "priority": 0 }
  }
},
"delete": {
  "min_age": "90d",
  "actions": { "delete": { "delete_searchable_snapshot": true } }
}

Phase transitions stall when cluster health degrades, allocation filters mismatch node roles, or disk watermark thresholds are breached. Injecting deterministic wait loops and retry logic to prevent stuck states during high-throughput ingestion windows is demonstrated in automating phase transitions with Python.

Cross-Domain Synchronization Workflows

In multi-tenant or multi-environment deployments, policy drift causes catastrophic data loss or storage exhaustion. The remedy is to treat ILM definitions as infrastructure-as-code: the canonical policy lives in a repository, and no cluster is ever edited by hand.

Single source of truth. Store each policy as a versioned JSON document. A change is a pull request, reviewed and merged, never a live console edit.
Schema validation before apply. Validate structure and phase ordering in CI before the payload reaches any cluster. There is no server-side _ilm/validate endpoint, so the guardrail must live in your pipeline.
Ordered rollout. Apply to a staging cluster, run an explain_lifecycle reconciliation to confirm adoption, then promote the identical artifact to production. The artifact never changes between environments — only its target does.
Reconciliation over blind re-apply. Compare the live policy body against the desired body and only issue a write when they differ, so a redeploy of unchanged policy is a genuine no-op.

When data is replicated across clusters with cross-cluster replication, ILM state must be reconciled deliberately. Follower indices are read-only and their lifecycle is managed on the leader: follower clusters should inherit the lifecycle configuration but must not act on the leader’s physical index state until an index is promoted. Lifecycle synchronization also has to stay coordinated with any automated reindexing pipeline that rewrites indices in place — a reindex that recreates an index without re-attaching its policy silently drops that index out of lifecycle management.

Security & Governance

Lifecycle automation is powerful enough to delete production data, so who can change a policy — and how that change is recorded — is a design concern, not an afterthought.

Role separation and RBAC

Separate the ability to author a policy from the ability to apply it, and both from the ability to manage the indices a policy governs. Managing lifecycle policies requires the manage_ilm privilege at cluster scope; applying a policy to indices and executing rollover requires index-level manage on the target pattern. A CI/CD service account should hold exactly the privileges it needs and nothing more.

POST _security/role/ilm-deployer
{
  "cluster": ["manage_ilm"],
  "indices": [
    {
      "names": ["logs-app-*"],
      "privileges": ["manage", "view_index_metadata"]
    }
  ]
}

Application tokens that merely write documents should never carry manage_ilm. The full pattern for scoping author-versus-apply boundaries is covered in securing ILM policies with RBAC.

Ownership, audit, and CI/CD

Every policy needs a named owner — a team, not a person — recorded alongside the definition in version control. Enable Elasticsearch audit logging so that put_lifecycle, delete_lifecycle, and move_to_step calls are attributable to a principal and a timestamp; correlate those events with the corresponding pull request to close the loop between what was declared and what was executed. The deployment itself belongs in a pipeline: validate, apply to staging, reconcile, promote, and record the run. This makes lifecycle changes reviewable, reversible, and forensically traceable when a retention incident is investigated after the fact.

Production Automation with Python v8+

Modern automation relies on the official Python v8+ client, which provides strict type safety, native async support, and robust connection pooling. Synchronizing local state with remote cluster state requires careful handling of rate limits and transient network failures. A robust reconciliation loop compares the desired policy against the live one, applies only real differences, and polls ilm.explain_lifecycle with exponential backoff while respecting transient errors. Complex orchestration — dynamic alias swapping, concurrent policy updates across hundreds of indices, and transactional rollback on failure — builds on the same idempotent primitives shown below.

import time
import logging
from elasticsearch import Elasticsearch
from elasticsearch.exceptions import ApiError, ConnectionError, NotFoundError

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

def apply_ilm_policy_idempotent(client: Elasticsearch, policy_name: str, policy_body: dict) -> bool:
    """Deploy or update an ILM policy only if the definition has changed."""
    try:
        existing = client.ilm.get_lifecycle(name=policy_name)
        current_policy = existing[policy_name].get("policy", {})
        if current_policy == policy_body.get("policy", {}):
            logger.info("Policy %s is already synchronized.", policy_name)
            return True
    except NotFoundError:
        # First-time deployment: no existing policy to compare against.
        pass

    client.ilm.put_lifecycle(name=policy_name, policy=policy_body["policy"])
    logger.info("Policy %s deployed successfully.", policy_name)
    return True

def verify_ilm_transition(client: Elasticsearch, index_pattern: str, max_retries: int = 5) -> dict:
    """Poll ILM explain endpoint with exponential backoff until phase stabilizes."""
    backoff = 1.0
    for attempt in range(max_retries):
        try:
            response = client.ilm.explain_lifecycle(index=index_pattern)
            indices = response.get("indices", {})
            if not indices:
                logger.warning("No indices matched pattern: %s", index_pattern)
                return {}

            # Check for stuck or error states. In the explain response, `step` is a
            # string and the error detail lives in the separate `step_info` object.
            for idx_name, state in indices.items():
                if state.get("step") == "ERROR":
                    raise RuntimeError(f"ILM stuck on {idx_name}: {state.get('step_info')}")
            return indices
        except (ApiError, ConnectionError) as e:
            logger.warning("ILM explain failed (attempt %d/%d): %s", attempt + 1, max_retries, e)
            time.sleep(backoff)
            backoff = min(backoff * 2, 30.0)
    raise TimeoutError("ILM state verification timed out.")

Two details make this production-safe. First, a missing policy raises NotFoundError (a subclass of ApiError) rather than requiring a manual status_code check, so first-time deployment falls through cleanly to the write. Second, the reconciliation only writes when the live policy body differs from the desired one, preserving idempotency across repeated pipeline runs. When a step genuinely wedges, recover it with the per-index retry — client.ilm.retry(index=idx_name) — after resolving the underlying allocation or watermark problem; there is no retry_lifecycle method, and a blind retry against an unresolved cause simply loops back to ERROR.

Monitoring & Observability

ILM execution is asynchronous and subject to cluster resource contention. Blindly trusting the background daemon leads to silent failures, so proactive state tracking is mandatory. The core signal is the explain API, correlated with allocation diagnostics.

Endpoint	Purpose
`GET <index>/_ilm/explain`	Current phase, action, step, and `step_info` for each index — the primary drift and error signal.
`GET _cat/allocation?v`	Per-node disk usage and shard counts; reveals which tier is near a watermark.
`GET _cluster/allocation/explain`	Why a specific shard is unassigned or cannot relocate — the root-cause tool for a stalled transition.
`GET _ilm/status`	Whether the ILM coordinator is `RUNNING`, `STOPPING`, or `STOPPED` cluster-wide.

Alert on three conditions: any index whose step equals ERROR, any index whose time-in-step exceeds a phase-appropriate SLA (a proxy for a silent stall), and any target tier crossing its high disk watermark. When policies do fail, automated fallback mechanisms must be available — temporary suspension via POST _ilm/stop, a manual step override through POST <index>/_ilm/move/<step>, or graceful degradation to a retention-only policy. Continuous observability into step execution, error codes, and retry counts is detailed in monitoring ILM execution and error states.

Explain-driven signals feed one gate; any failing condition pages on-call and drives remediation, while an all-clear result loops back into continuous polling.

Common Failure Modes

Symptom	Root cause	Remediation
Index stuck in `WAITING` on a warm/cold transition	Target tier lacks capacity or crossed `disk.watermark.high`	Free space or add nodes to the tier; verify with `_cluster/allocation/explain`, then `ilm.retry`.
Rollover never fires	No `rollover_alias` set, or the alias `is_write_index` is missing	Bind `index.lifecycle.rollover_alias` in the template and bootstrap the first index with `is_write_index: true`.
Two clusters running different policy versions	Live console edit bypassed version control	Re-apply the canonical artifact from CI; enforce apply-only-from-pipeline via RBAC.
Step wedged in `ERROR` and self-retry loops	Underlying allocation or mapping issue unresolved	Diagnose via `_ilm/explain` `step_info`, fix the cause, then `POST <index>/_ilm/retry`.
Index silently drops out of lifecycle after reindex	Reindexed target created without `index.lifecycle.name`	Attach the policy in the destination template before reindexing; confirm with `_ilm/explain`.
Force merge saturates the deployment during ingestion	`forcemerge` scheduled while the index still takes writes	Keep force merge in post-rollover phases only; never in `hot`.

Building custom ILM policies via API — version-controlled policy payloads and rollout.
Automating phase transitions with Python — active orchestration and stuck-state recovery.
Monitoring ILM execution and error states — explain-driven alerting and remediation.
ILM architecture and fundamentals — node topology, allocation awareness, and phase-state basics.
Automated reindexing pipelines and workflows — keeping reindexed indices in lifecycle sync.

← Back to index-lifecycle-management.org home

ILM Policy Design & Lifecycle Synchronization #

Synchronization as a Control Loop #

Core Architecture & Tier Routing #

Node roles and allocation awareness #

Binding a policy to a template #

Phase Transition Mechanics #

Hot phase: rollover and priority #

Warm phase: force merge and relocation #

Cold and delete phases #

Cross-Domain Synchronization Workflows #

Security & Governance #

Role separation and RBAC #

Ownership, audit, and CI/CD #

Production Automation with Python v8+ #

Monitoring & Observability #

Common Failure Modes #

Related #

Explore deeper