Building Custom ILM Policies via API

Default lifecycle templates and the Kibana policy editor rarely survive contact with a high-throughput ingestion pipeline. Clicking through a UI produces a policy that lives only in cluster state, drifts silently between staging and production, and leaves no reviewable diff when someone changes a rollover threshold at 2 a.m. The operational challenge this page solves is turning an ILM policy into version-controlled, programmable infrastructure: a declarative JSON contract you apply idempotently through the REST API, bind to an index template, and verify against live cluster state before it ever touches production data.

Building policies through the API decouples policy definition from index creation. You can adjust rollover thresholds, force-merge targets, and cold-tier allocation rules without interrupting active indexing, and every change flows through the same review and rollback machinery as the rest of your infrastructure-as-code. This page is the programmatic authoring layer inside the broader discipline of ILM policy design and lifecycle synchronization; it pairs with automating phase transitions with Python for scheduled orchestration and monitoring ILM execution and error states for the observability that confirms a deployed policy actually runs.

Prerequisites

Elasticsearch 8.x cluster reachable over HTTPS, with the data_hot, data_warm, and data_cold node roles (or the legacy node.attr.data attributes) assigned to the nodes each phase targets.
elasticsearch-py v8.0+ installed — this page uses the v8 client surface (ilm.put_lifecycle, indices.put_index_template, keyword-argument request bodies, typed ApiError), not the legacy body= pattern.
A scoped automation identity holding cluster manage_ilm and manage_index_templates, provisioned per securing ILM policies with RBAC — the coordinator evaluates these privileges at execution time.
An API key or service-account credential exported as an environment variable; never embed admin credentials in the deployment script.
Index patterns and a rollover alias naming convention finalised (for example observability-logs-* writing through the observability-logs alias), so the template and bootstrap index agree.
Access to GET <index>/_ilm/explain and GET _ilm/policy/<name> to verify attachment and diff deployed against committed configuration.

Architecture: From JSON Contract to Managed Index

A custom policy is not applied to indices directly. Three distinct API objects have to line up before the coordinator will manage anything: the policy (the phase state machine), the index template (the binding that injects index.lifecycle.name and the rollover alias into every new backing index), and the bootstrap index plus write alias (the first physical index the policy acts on). Miss any one and the policy exists but manages nothing — a policy with no template attaches to nothing, and a template with no bootstrapped write alias has nothing to roll over.

The sequence below is the deployment contract this page implements end to end: author the policy, bind it through a template, bootstrap the first write index, then confirm attachment through explain.

Because the coordinator wakes only every indices.lifecycle.poll_interval (10 minutes by default), none of these steps take effect instantly. Applying the policy, then immediately querying _ilm/explain and seeing "managed": true confirms the binding is correct even though the first phase evaluation may be minutes away. That polling delay is why API-driven deployment must be idempotent: re-running the same manifest is a no-op, so CI/CD can re-apply on every merge without racing the coordinator.

Configuration Reference

A production-grade policy needs explicit phase definitions, deterministic action ordering, and strict allocation constraints. The payload below drives a four-phase observability lifecycle; each field is annotated with the behavior it controls. The tier routing here assumes the hot-warm-cold architecture that governs where each phase’s shards land.

PUT _ilm/policy/custom-observability-policy
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_primary_shard_size": "50gb",
            "max_age": "1d"
          },
          "set_priority": { "priority": 100 }
        }
      },
      "warm": {
        "min_age": "3d",
        "actions": {
          "shrink": { "number_of_shards": 1 },
          "forcemerge": { "max_num_segments": 1 },
          "allocate": {
            "number_of_replicas": 1,
            "require": { "data": "warm" }
          },
          "set_priority": { "priority": 50 }
        }
      },
      "cold": {
        "min_age": "14d",
        "actions": {
          "allocate": { "require": { "data": "cold" } },
          "set_priority": { "priority": 0 }
        }
      },
      "delete": {
        "min_age": "30d",
        "actions": { "delete": {} }
      }
    }
  }
}

Three field-level decisions carry most of the operational risk:

max_primary_shard_size should track your JVM heap and merge capacity. Primaries much beyond 50 GB inflate segment-merge overhead and degrade range-query latency during concurrent search. Size the trigger so shards land in the 30–50 GB band under peak ingestion; getting this right is the same discipline covered in depth under configuring index rollover conditions.
forcemerge blocks indexing and spikes CPU. Keep it in the warm phase, after shrink, where write volume has dropped. Running it on hot data throttles ingestion.
min_age is measured from rollover, not index creation. A warm min_age of 3d means three days after the index rolled out of hot — not three days after it was first created. Anchoring these ages to real data-value decay, rather than arbitrary defaults, is what keeps retention aligned with the business SLA.

Binding the policy through an index template

Policies never auto-attach. An index template injects the lifecycle name and rollover alias into every future index matching the pattern, turning the policy into a contract for the whole observability-logs-* namespace.

PUT _index_template/observability-template
{
  "index_patterns": ["observability-logs-*"],
  "template": {
    "settings": {
      "index.lifecycle.name": "custom-observability-policy",
      "index.lifecycle.rollover_alias": "observability-logs",
      "number_of_shards": 2,
      "number_of_replicas": 1
    }
  }
}

Step-by-Step Implementation

Follow the objects in dependency order — policy, template, bootstrap, verify. The first two calls are the JSON above; steps 3 and 4 complete the chain.

1. Apply the policy

PUT _ilm/policy/custom-observability-policy with the payload from the reference section. Re-applying an identical policy is a no-op, so this is safe to run on every deploy.

2. Bind the template

PUT _index_template/observability-template with the template body above. This must exist before the first backing index is created, or that index will be created unmanaged.

3. Bootstrap the first write index

The template governs future indices, but the very first one has to be created explicitly with the write alias so rollover has something to act on:

PUT observability-logs-000001
{
  "aliases": {
    "observability-logs": { "is_write_index": true }
  }
}

The is_write_index: true flag is mandatory. Without it the rollover action rejects the transition with illegal_argument_exception, because ILM cannot tell which physical index behind the alias should receive writes.

4. Deploy the whole chain from Python

For CI/CD, the entire sequence collapses into one idempotent, retry-aware script. The v8 client methods map one-to-one onto the REST calls above.

import os
import logging
from elasticsearch import Elasticsearch, ApiError

logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
logger = logging.getLogger(__name__)

def deploy_ilm_infrastructure() -> None:
    """Apply policy, bind template, bootstrap the write alias — idempotently."""
    client = Elasticsearch(
        hosts=[os.getenv("ES_HOST", "https://localhost:9200")],
        api_key=os.getenv("ES_API_KEY"),   # scoped manage_ilm + manage_index_templates
        verify_certs=True,
    )

    policy_name = "custom-observability-policy"
    template_name = "observability-template"
    alias = "observability-logs"

    policy = {
        "phases": {
            "hot": {"min_age": "0ms", "actions": {
                "rollover": {"max_primary_shard_size": "50gb", "max_age": "1d"},
                "set_priority": {"priority": 100}}},
            "warm": {"min_age": "3d", "actions": {
                "shrink": {"number_of_shards": 1},
                "forcemerge": {"max_num_segments": 1},
                "allocate": {"number_of_replicas": 1, "require": {"data": "warm"}},
                "set_priority": {"priority": 50}}},
            "cold": {"min_age": "14d", "actions": {
                "allocate": {"require": {"data": "cold"}},
                "set_priority": {"priority": 0}}},
            "delete": {"min_age": "30d", "actions": {"delete": {}}},
        }
    }

    try:
        # 1. Idempotent policy create/update (cluster manage_ilm).
        client.ilm.put_lifecycle(name=policy_name, policy=policy)
        logger.info("Policy '%s' applied.", policy_name)

        # 2. Bind the template (cluster manage_index_templates).
        client.indices.put_index_template(
            name=template_name,
            index_patterns=[f"{alias}-*"],
            template={
                "settings": {
                    "index.lifecycle.name": policy_name,
                    "index.lifecycle.rollover_alias": alias,
                    "number_of_shards": 2,
                    "number_of_replicas": 1,
                }
            },
        )
        logger.info("Index template '%s' bound to policy.", template_name)

        # 3. Bootstrap the first write index only if the alias does not yet exist.
        if not client.indices.exists_alias(name=alias):
            client.indices.create(
                index=f"{alias}-000001",
                aliases={alias: {"is_write_index": True}},
            )
            logger.info("Bootstrap index '%s-000001' created with write alias.", alias)

    except ApiError as exc:
        logger.error("ILM deployment failed: %s %s", exc.meta.status, exc.body)
        raise

if __name__ == "__main__":
    deploy_ilm_infrastructure()

This pattern reuses the scoped identity from the RBAC model and feeds directly into the scheduled orchestration described in automating phase transitions with Python, which keeps state deterministic across cluster restarts and rolling upgrades.

Verification

Applying a policy tells you nothing about whether it attached. Confirm the binding explicitly before trusting it.

First, read the deployed policy back and diff it against the committed source — this is how you catch drift between what version control says and what the deployment runs:

GET _ilm/policy/custom-observability-policy

Next, ask the write index whether it is actually managed and which phase it sits in:

GET observability-logs-000001/_ilm/explain

{
  "indices": {
    "observability-logs-000001": {
      "managed": true,
      "policy": "custom-observability-policy",
      "phase": "hot",
      "action": "rollover",
      "step": "check-rollover-ready"
    }
  }
}

"managed": true with the expected policy, phase, and a non-ERROR step proves the whole chain — policy, template, bootstrap — is wired correctly. A step of ERROR means the binding is fine but a phase action failed; jump to Troubleshooting. Finally, confirm shards are landing where the policy intends once data ages into warm or cold:

GET _cat/shards/observability-logs-*?v&h=index,shard,prirep,state,node

Threshold Tuning & Performance Guidance

Static thresholds fail under variable ingestion. Tune against empirical cluster metrics, not defaults:

Shard size versus query latency. max_primary_shard_size should rarely exceed 50 GB. Larger primaries increase merge overhead and degrade range queries. Watch indices.segments.memory and indices.fielddata.memory to validate that your sizing leaves heap headroom.
Phase-transition timing. min_age must reflect data-value decay. Log pipelines typically fall to warm at 3–7 days; compliance archives hold cold for 30–90 days. Anchor min_age to the retention SLA rather than a copied default.
Force-merge cost. max_num_segments: 1 gives the best query performance but is the most expensive to compute. For high-write indices, max_num_segments: 5 balances merge cost against search speed — and never place forcemerge in the hot phase.
Allocation filtering. The require: { data: warm } and require: { data: cold } targets must correspond to real node roles or node.attr.data values. If no eligible node has capacity, the transition stalls in a WAITING state rather than failing loudly. Where a tier can legitimately fill up, an ordered fallback routing strategy for data retention keeps shards allocated instead of leaving them unassigned.

Troubleshooting

ILM failures rarely crash; they surface as stuck phases, allocation mismatches, or alias conflicts. Work the diagnostic flow in order.

Phase stuck with `step: ERROR`

Symptom: _ilm/explain shows the index parked with "step": "ERROR" and a step_info object describing the failure. Resolution:

Read the failure reason: GET observability-logs-000001/_ilm/explain and inspect step_info.type and step_info.reason — commonly a missing node attribute, insufficient disk, or a security_exception.
Fix the root cause (add the node role, free disk, or grant the missing privilege via RBAC).
Retry the failed step — this applies only to indices in the ERROR step and does not bypass min_age:

POST observability-logs-000001/_ilm/retry

Index stalls in `hot` past its `min_age`

Symptom: the index will not move to warm even though its age exceeds the warm min_age. Resolution: the destination tier has no eligible node. Confirm node attributes and disk headroom:

GET _cat/nodes?v&h=name,node.role,disk.used_percent

If the target role is missing, assign it in elasticsearch.yml and restart the node, or apply transient allocation settings through the _cluster/settings API. Persistent transition delays driven by disk watermarks or rebalancing contention are covered in troubleshooting ILM phase transition delays.

Forcing a transition after a corrected policy

To deliberately move an index from one step to another — for example to skip ahead after fixing a policy — use the move API with explicit current and next steps. Use it sparingly and only after a verified fix:

POST _ilm/move/observability-logs-000001
{
  "current_step": { "phase": "hot", "action": "rollover", "name": "check-rollover-ready" },
  "next_step": { "phase": "warm", "action": "shrink", "name": "shrink" }
}

Rollover rejected by an alias conflict

Symptom: rollover fails because the write alias points to multiple indices or lacks is_write_index. Resolution: reset the alias so exactly one index is the write target:

POST _aliases
{
  "actions": [
    { "remove": { "index": "observability-logs-*", "alias": "observability-logs" } },
    { "add": { "index": "observability-logs-000001", "alias": "observability-logs", "is_write_index": true } }
  ]
}

FAQ

Why build ILM policies through the API instead of the Kibana editor?

Because a UI-authored policy lives only in cluster state — it has no reviewable diff, drifts between environments, and cannot be re-applied deterministically by CI/CD. Defining the policy as JSON and applying it with PUT _ilm/policy/... (or ilm.put_lifecycle) makes it version-controlled infrastructure: the same manifest applies idempotently to staging and production, every change is a pull request, and you can diff the deployed policy against source with GET _ilm/policy/<name>.

Why doesn't my policy manage any indices after I apply it?

Applying a policy attaches it to nothing on its own. Indices adopt a policy through the index.lifecycle.name setting, which an index template injects into every new backing index. If the template does not exist, or was created after the first index, that index is unmanaged. Confirm the binding with GET <index>/_ilm/explain and check for "managed": true; if it is false, verify the template pattern matches the index name and re-create the index or apply the setting explicitly.

What does the `is_write_index` flag on the bootstrap alias do?

It tells ILM which physical index behind the rollover alias currently receives writes. The rollover action creates a new index, then flips the write flag to it. Without is_write_index: true on the initial bootstrap index, rollover cannot identify a write target and rejects the transition with illegal_argument_exception. Every rollover-managed alias needs exactly one index flagged as the write index at any time.

Is re-running the deployment script safe on an existing policy?

Yes. ilm.put_lifecycle and put_index_template are idempotent — applying an identical definition is a no-op, and applying a changed one converges the deployment to the new state. The only non-idempotent step is bootstrapping the first index, which the script guards with an exists_alias check so it never tries to re-create an existing write alias. This is what lets CI/CD re-apply the full manifest on every merge without racing the coordinator.

When should I use `_ilm/retry` versus `_ilm/move`?

Use _ilm/retry to re-run a step that failed into the ERROR state after you have fixed its root cause — it resumes the same step and still respects min_age. Use _ilm/move only to deliberately jump an index from one step to a different one, for example to skip ahead after correcting a policy. move bypasses normal progression and can strand an index if the target step is wrong, so reserve it for verified manual interventions.

Automating phase transitions with Python — schedule and orchestrate the policies you author here.
Monitoring ILM execution and error states — confirm a deployed policy actually runs and catch stuck steps.
Troubleshooting ILM phase transition delays — cluster-level tuning for stalls driven by watermarks and rebalancing.
Configuring index rollover conditions — size the max_primary_shard_size and max_age triggers this policy depends on.
Securing ILM policies with RBAC — the scoped manage_ilm identity that applies these policies safely.

← Back to ILM Policy Design & Lifecycle Synchronization

Building Custom ILM Policies via API #

Prerequisites #

Architecture: From JSON Contract to Managed Index #

Configuration Reference #

Binding the policy through an index template #

Step-by-Step Implementation #

1. Apply the policy #

2. Bind the template #

3. Bootstrap the first write index #

4. Deploy the whole chain from Python #

Verification #

Threshold Tuning & Performance Guidance #

Troubleshooting #

Phase stuck with step: ERROR #

Index stalls in hot past its min_age #

Forcing a transition after a corrected policy #

Rollover rejected by an alias conflict #

FAQ #

Related #

Explore deeper

Related in ILM Policy Design