Building Custom ILM Policies via API

Index Lifecycle Management (ILM) is the operational backbone of production Elasticsearch data retention, yet default templates rarely survive contact with high-throughput ingestion pipelines or complex search workloads. Building Custom ILM Policies via API shifts lifecycle control from static, UI-driven configuration to version-controlled, programmable infrastructure. This model enables precise alignment between data value decay curves, storage tiering economics, and query performance SLAs. When policies are defined declaratively through the REST API, they integrate directly into CI/CD pipelines, allowing engineering teams to enforce consistent ILM Policy Design & Lifecycle Synchronization across development, staging, and production clusters without manual drift.

The operational advantage lies in decoupling policy definition from index creation. By treating ILM as an API-driven contract, teams can dynamically adjust rollover thresholds, force-merge targets, and cold-tier allocation rules without interrupting active indexing. This approach establishes a clear audit trail for compliance and enables rapid rollback when mapping migrations or shard rebalancing introduce latency. For search engineers and log analytics teams, API-managed policies eliminate configuration drift and provide deterministic state transitions.

flowchart LR
  P["put_lifecycle (policy)"] --> T["put_index_template (bind policy + rollover alias)"]
  T --> B["Bootstrap index-000001 (is_write_index)"]
  B --> X["explain_lifecycle: verify attachment"]

Policy Payload & API Deployment

A production-grade ILM policy requires explicit phase definitions, deterministic action sequencing, and strict shard allocation constraints. The following payload demonstrates a hardened policy structure deployed via PUT _ilm/policy/custom-observability-policy:

curl -X PUT "localhost:9200/_ilm/policy/custom-observability-policy" \
-H "Content-Type: application/json" \
-d '{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_primary_shard_size": "50gb",
            "max_age": "1d"
          },
          "set_priority": { "priority": 100 }
        }
      },
      "warm": {
        "min_age": "3d",
        "actions": {
          "shrink": { "number_of_shards": 1 },
          "forcemerge": { "max_num_segments": 1 },
          "allocate": {
            "number_of_replicas": 1,
            "require": { "data": "warm" }
          },
          "set_priority": { "priority": 50 }
        }
      },
      "cold": {
        "min_age": "14d",
        "actions": {
          "allocate": {
            "require": { "data": "cold" }
          },
          "set_priority": { "priority": 0 }
        }
      },
      "delete": {
        "min_age": "30d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}'

Key operational considerations:

  • max_primary_shard_size should align with your cluster’s JVM heap and segment merge capacity. Exceeding 50GB often degrades query latency during concurrent searches.
  • forcemerge is CPU-intensive and blocks indexing. Schedule it during warm phase transitions when write volume drops.
  • set_priority dictates shard recovery order during node failures. Hot indices must retain the highest priority to preserve ingestion continuity.

Template Binding & Rollover Configuration

ILM policies do not auto-attach to indices. You must bind them via an index template that explicitly declares the lifecycle name and rollover alias. The template acts as a contract for all future indices matching the pattern.

curl -X PUT "localhost:9200/_index_template/observability-template" \
-H "Content-Type: application/json" \
-d '{
  "index_patterns": ["observability-logs-*"],
  "template": {
    "settings": {
      "index.lifecycle.name": "custom-observability-policy",
      "index.lifecycle.rollover_alias": "observability-logs",
      "number_of_shards": 2,
      "number_of_replicas": 1
    }
  }
}'

After template deployment, bootstrap the initial index and alias:

curl -X PUT "localhost:9200/observability-logs-000001" \
-H "Content-Type: application/json" \
-d '{
  "aliases": {
    "observability-logs": { "is_write_index": true }
  }
}'

The is_write_index: true flag is mandatory for the rollover action to function. Without it, ILM will reject phase transitions with illegal_argument_exception.

Python v8+ Client Orchestration

Production workflows require idempotent, retry-aware automation. The official elasticsearch Python client v8+ provides native ILM orchestration methods that map directly to the REST API. The following script deploys the policy, binds the template, and verifies state synchronization.

import os
import logging
from elasticsearch import Elasticsearch, exceptions

logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")
logger = logging.getLogger(__name__)

def deploy_ilm_infrastructure():
    # Initialize v8 client with API key authentication
    client = Elasticsearch(
        hosts=[os.getenv("ES_HOST", "https://localhost:9200")],
        api_key=os.getenv("ES_API_KEY"),
        verify_certs=True
    )

    policy_name = "custom-observability-policy"
    template_name = "observability-template"
    alias = "observability-logs"

    policy_payload = {
        "policy": {
            "phases": {
                "hot": {"min_age": "0ms", "actions": {"rollover": {"max_primary_shard_size": "50gb", "max_age": "1d"}, "set_priority": {"priority": 100}}},
                "warm": {"min_age": "3d", "actions": {"shrink": {"number_of_shards": 1}, "forcemerge": {"max_num_segments": 1}, "allocate": {"number_of_replicas": 1, "require": {"data": "warm"}}, "set_priority": {"priority": 50}}},
                "cold": {"min_age": "14d", "actions": {"allocate": {"require": {"data": "cold"}}, "set_priority": {"priority": 0}}},
                "delete": {"min_age": "30d", "actions": {"delete": {}}}
            }
        }
    }

    try:
        # Idempotent policy creation/update
        client.ilm.put_lifecycle(name=policy_name, body=policy_payload)
        logger.info(f"Policy '{policy_name}' deployed/updated successfully.")

        # Template binding
        template_payload = {
            "index_patterns": [f"{alias}-*"],
            "template": {
                "settings": {
                    "index.lifecycle.name": policy_name,
                    "index.lifecycle.rollover_alias": alias,
                    "number_of_shards": 2,
                    "number_of_replicas": 1
                }
            }
        }
        client.indices.put_index_template(name=template_name, body=template_payload)
        logger.info(f"Index template '{template_name}' bound to policy.")

        # Bootstrap initial index if alias doesn't exist
        if not client.indices.exists_alias(name=alias):
            client.indices.create(
                index=f"{alias}-000001",
                body={"aliases": {alias: {"is_write_index": True}}}
            )
            logger.info(f"Bootstrap index '{alias}-000001' created with write alias.")

    except exceptions.ApiError as e:
        logger.error(f"ILM deployment failed: {e.meta.status} - {e.error}")
        raise

if __name__ == "__main__":
    deploy_ilm_infrastructure()

This orchestration pattern aligns with Automating Phase Transitions with Python workflows, ensuring deterministic state application across cluster restarts and rolling upgrades.

Threshold Tuning for Production Workloads

Static thresholds fail under variable ingestion loads. Tune ILM parameters based on empirical cluster metrics:

  1. Shard Size vs. Query Latency: max_primary_shard_size should rarely exceed 50GB. Larger shards increase segment merge overhead and degrade range query performance. Monitor indices.fielddata.memory and indices.segments.memory to validate sizing.
  2. Phase Transition Timing: min_age must account for data value decay. Log pipelines often drop to warm at 3–7 days, while compliance archives require cold tiering at 30–90 days. Align min_age with business retention SLAs, not arbitrary defaults.
  3. Force-Merge Impact: forcemerge blocks indexing and spikes CPU. Use max_num_segments: 1 only in warm/cold phases. For high-write indices, consider max_num_segments: 5 to balance merge cost and query performance.
  4. Allocation Filtering: Node attributes (data: warm, data: cold) must be explicitly configured in elasticsearch.yml (node.attr.data: warm). ILM will stall if allocation targets lack matching nodes.

Validate configuration drift and execution health using Monitoring ILM Execution & Error States before promoting policies to production.

Real-World Debugging & Error Handling

ILM failures rarely manifest as outright crashes; they surface as stuck phases, allocation mismatches, or alias conflicts. Follow this diagnostic flow:

1. Inspect ILM State

curl -X GET "localhost:9200/observability-logs-000001/_ilm/explain"

Look for phase, action, and step fields. A step of ERROR indicates a hard failure (e.g., missing node attribute, insufficient disk).

2. Resolve Allocation Blocks

If indices remain in hot past min_age, verify node attributes:

curl -X GET "localhost:9200/_cat/nodes?v&h=name,node.attr.data,disk.used_percent"

If node.attr.data is missing, update elasticsearch.yml on target nodes and restart, or use the cluster settings API to apply transient allocation rules.

3. Recover or Force a Transition

To re-run a failed step after fixing its root cause, retry the affected index (this applies only to indices in the ERROR step and does not bypass min_age):

curl -X POST "localhost:9200/observability-logs-000001/_ilm/retry"

To deliberately force an index from one step to another (for example, to skip ahead after a corrected policy), use the move API with explicit current/next steps:

curl -X POST "localhost:9200/_ilm/move/observability-logs-000001" \
-H "Content-Type: application/json" \
-d '{
  "current_step": { "phase": "hot", "action": "rollover", "name": "check-rollover-ready" },
  "next_step": { "phase": "warm", "action": "shrink", "name": "shrink" }
}'

Use the move API sparingly and only after verified configuration fixes.

4. Handle Alias Conflicts

Rollover fails if the write alias points to multiple indices or lacks is_write_index. Reset alias state:

curl -X POST "localhost:9200/_aliases" \
-H "Content-Type: application/json" \
-d '{
  "actions": [
    { "remove": { "index": "observability-logs-*", "alias": "observability-logs" } },
    { "add": { "index": "observability-logs-000001", "alias": "observability-logs", "is_write_index": true } }
  ]
}'

Persistent transition bottlenecks often stem from disk watermark thresholds or shard rebalancing contention. Review Troubleshooting ILM Phase Transition Delays for cluster-level tuning strategies.

CI/CD Integration & Audit Compliance

Treat ILM policies as infrastructure code. Store JSON payloads in version control, validate them against cluster schema before deployment, and run integration tests in staging clusters. Use the GET _ilm/policy/<name> endpoint to diff deployed vs. committed configurations. This practice guarantees reproducible lifecycle states and satisfies compliance requirements for data retention auditing.

For authoritative reference on ILM action semantics and cluster configuration, consult the official Elasticsearch Index Lifecycle Management documentation and the Python Client API Reference.