Securing ILM Policies with RBAC

Index Lifecycle Management runs as an always-on background coordinator with the authority to delete indices, relocate shards across node tiers, and rewrite index settings — all without a human in the loop. That is exactly what makes it dangerous when access is unscoped. A single over-privileged token can drop a production policy, force an index into its delete phase early, or reroute hot shards onto cold nodes, and because the coordinator acts on the token’s effective privileges at execution time, the blast radius is the whole retention pipeline. The specific challenge this page solves is drawing hard execution boundaries around lifecycle automation: separating who may author a policy from who may apply it, isolating lifecycle triggers from general index writes, and binding every automated action to a scoped service account rather than a human’s credentials.

This governance layer sits inside the broader control plane described in Index Lifecycle Management (ILM). Where that guide explains how phase transitions fire, this one explains who is allowed to make them fire — and how to prove, before a deployment ships, that a token holds exactly the privileges it needs and nothing more.

Prerequisites

Elasticsearch 8.x cluster with the security features enabled (xpack.security.enabled: true) and TLS configured on the HTTP layer — API keys and role bindings are unavailable on an unsecured cluster.
elasticsearch-py v8.0+ installed — this page uses the v8 client surface (security.put_role, security.create_api_key, keyword-argument request bodies, typed ApiError), not the legacy body= pattern.
A manage_security account (or the built-in elastic superuser) available only for the one-time role bootstrap — it must never be reused for routine automation.
Index patterns finalised (for example logs-app-*, metrics-infra-*) so index privileges are scoped to the namespaces a given pipeline owns, not granted cluster-wide.
An index template that attaches policies through index.lifecycle.name, so applying a policy is a template operation gated by manage_index_templates rather than an ad-hoc per-index setting change.
Access to GET _security/_authenticate and GET _security/role/<name> to verify effective privileges before and after each change.

Architecture: Separating Authoring from Application

ILM security rests on one principle: the privilege to define a lifecycle policy and the privilege to apply it to live data are different, and should belong to different identities. Policy authoring is a privileged cluster-level act governed by manage_ilm; attaching a policy to indices is a template act governed by manage_index_templates; and the day-to-day writes that feed the managed indices are index-level acts governed by write on a specific pattern. Collapsing these into a single admin role is the root cause of most ILM security incidents, because it lets an application token that only needs to index documents also delete policies and reroute shards.

The coordinator evaluates the acting token’s privileges at the moment a phase step runs, not when the policy was created. That timing is the crux of the model: a policy authored by a well-scoped CI account can still fail at execution if the service account carrying the write alias lacks manage on the target pattern. This is why configuring index rollover conditions and RBAC are inseparable — the rollover step needs manage on the write alias, and the hot-warm-cold architecture migration steps need manage on the pattern being relocated. Get the roles wrong and phases stall silently with a security_exception buried in _ilm/explain.

Three identity classes map cleanly onto three privilege sets:

Authoring identity (CI/CD). Holds manage_ilm and manage_index_templates as cluster privileges. It defines policies and binds them to templates, but its index privileges are read-only — it never writes documents.
Execution identity (the managed pipeline). Holds manage and write scoped to its own index patterns, so ILM can roll, shrink, force-merge, and relocate the indices that identity owns without reaching into any other namespace.
Observer identity (operators, dashboards). Holds read_ilm and monitor only — enough to inspect _ilm/explain and phase progression, with no authority to mutate anything.

Configuration Reference

The role below is the execution identity: scoped to two index namespaces, with the exact cluster privileges ILM needs and nothing broader. Each field is annotated with the capability it unlocks.

PUT /_security/role/ilm_automation_role
{
  "cluster": [
    "manage_ilm",
    "manage_index_templates",
    "monitor"
  ],
  "indices": [
    {
      "names": ["logs-app-*", "metrics-infra-*"],
      "privileges": [
        "manage",
        "write",
        "read"
      ]
    }
  ]
}

A few privilege distinctions trip up most first deployments:

manage on an index does not grant cluster-wide ILM control. An index-level manage privilege lets the token roll, shrink, and force-merge its own indices, but creating or editing the policy object itself requires the top-level cluster privilege manage_ilm. A token with only index manage will get 403 on PUT _ilm/policy/....
manage_index_templates is required to attach a policy. Because policies are bound through index.lifecycle.name in an index template, a token that can author a policy but lacks manage_index_templates cannot wire it to any index — the PUT _index_template/... call rejects the lifecycle block.
read_ilm is a read-only cluster privilege, distinct from manage_ilm. Grant it to operators and dashboards so they can call _ilm/explain without being able to stop a policy or force a transition.

Policies applied through this role should follow a strict, version-controlled schema so that automated deployments do not drift between clusters. A safe baseline structure to start from is the ILM policy JSON template for beginners, which pins the fields ILM validates before it accepts a policy.

Step-by-Step Implementation

Provisioning follows a deterministic order: create the scoped role, bind it to a headless service account, then mint a short-lived API key for CI/CD that further narrows the role at the key level. The Python v8+ script below automates the sequence with typed error handling, and never reuses human credentials for automation.

import os
import logging
from elasticsearch import Elasticsearch, ApiError

logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
logger = logging.getLogger(__name__)

# v8 client — the admin identity is used ONCE, only for the bootstrap.
admin = Elasticsearch(
    os.getenv("ES_HOST", "https://localhost:9200"),
    basic_auth=(os.getenv("ES_ADMIN_USER"), os.getenv("ES_ADMIN_PASS")),
    ca_certs="/path/to/http_ca.crt",
    verify_certs=True,
)

def provision_ilm_execution_identity() -> None:
    """Create a scoped role, bind a headless service account, mint a CI API key."""
    role_name = "ilm_automation_role"
    user_name = "svc_ilm_automation"

    try:
        # 1. Define cluster + index privileges (least privilege).
        admin.security.put_role(
            name=role_name,
            cluster=["manage_ilm", "manage_index_templates", "monitor"],
            indices=[
                {
                    "names": ["logs-app-*", "metrics-infra-*"],
                    "privileges": ["manage", "write", "read"],
                }
            ],
        )
        logger.info("Role '%s' created.", role_name)

        # 2. Bind the role to a headless service account — never a human user.
        admin.security.put_user(
            username=user_name,
            password=os.getenv("SVC_ILM_PASSWORD"),
            roles=[role_name],
            full_name="ILM Automation Service",
            enabled=True,
        )
        logger.info("Service account '%s' provisioned.", user_name)

        # 3. Mint a CI/CD API key that NARROWS the role further via role_descriptors,
        #    and set an expiry so a leaked key ages out on its own.
        api_key = admin.security.create_api_key(
            name=f"{user_name}_ci_key",
            expiration="30d",
            role_descriptors={
                role_name: {
                    "cluster": ["manage_ilm", "manage_index_templates", "monitor"],
                    "indices": [
                        {"names": ["logs-app-*"], "privileges": ["manage", "write"]}
                    ],
                }
            },
        )
        logger.info("API key '%s' generated (expires in 30d).", api_key["id"])

    except ApiError as exc:
        logger.error("RBAC provisioning failed: %s %s", exc.meta.status, exc.body)
        raise

if __name__ == "__main__":
    provision_ilm_execution_identity()

Two decisions in that script matter more than the rest. First, role_descriptors on the API key is an intersection, not an addition — the key can only ever hold privileges that are also in the backing role, so narrowing it there (dropping metrics-infra-*, dropping read) shrinks the key’s reach without touching the role. Second, expiration="30d" bounds the damage of a leaked key: align it to your deployment cadence so keys rotate as a matter of routine rather than being long-lived secrets. The same scoped identity is what drives the programmatic policy work in building custom ILM policies via the API and the scheduled transitions in automating phase transitions with Python.

Verification

Never assume a role grants what you intended — inspect the effective privileges. First, confirm which identity a token actually resolves to and which roles it carries:

GET _security/_authenticate

The response echoes the authenticated username and its roles array. If an automation call is failing with 403, this is the first check: a token authenticating as the wrong user, or carrying no roles, explains most security_exception failures before you ever look at the policy.

Next, read the role back and confirm the returned cluster array explicitly includes manage_ilm — remember that index-level manage alone will not appear here as ILM authority:

GET _security/role/ilm_automation_role

{
  "ilm_automation_role": {
    "cluster": ["manage_ilm", "manage_index_templates", "monitor"],
    "indices": [
      {
        "names": ["logs-app-*", "metrics-infra-*"],
        "privileges": ["manage", "write", "read"]
      }
    ]
  }
}

Finally, prove the privilege is sufficient for the actual step ILM will run. Use the has-privileges API to ask the deployment a yes/no question rather than waiting for a phase to stall:

POST /_security/user/_has_privileges
{
  "cluster": ["manage_ilm"],
  "index": [
    { "names": ["logs-app-*"], "privileges": ["manage"] }
  ]
}

A response of "has_all_requested": true confirms the token can both manage policies and manage the target indices — the two privileges every rollover and tier-migration step depends on. Run this from the service account’s own credentials, not the admin’s, so you are testing the identity that will actually execute.

Threshold Tuning & Operational Guidance

RBAC has no shard-sizing knobs, but it does have operational tuning that governs blast radius and audit cost:

Scope index patterns as narrowly as the pipeline allows. A role granted manage on * can relocate or delete every managed index in the deployment. Prefer per-namespace patterns (logs-app-*) and split noisy tenants into their own roles so one compromised key cannot touch another team’s data.
Set API-key expirations to the deployment cadence, not “never”. A key that rotates every 30 days caps how long a leaked credential stays valid. Long-lived keys accumulate as unaudited standing access; treat a missing expiration as a finding.
Enable audit logging for security events. Turn on xpack.security.audit.enabled: true and filter for access_denied and run_as_denied events. These surface exactly which token was refused which privilege, turning a silent stalled phase into a searchable audit trail — and they add negligible indexing overhead when filtered to security categories.
Keep the bootstrap superuser out of automation entirely. The elastic account should provision roles once and then be locked away. Every recurring job should authenticate as a scoped service account, so audit logs attribute actions to a named identity rather than to a shared superuser.

Troubleshooting

ILM phase stuck in `hot`/`rollover` with a `security_exception`

Symptom: _ilm/explain shows the index parked on a rollover or allocate step, and step_info references a security_exception or the step never advances. Resolution:

Authenticate as the acting token: GET _security/_authenticate — confirm it resolves to the service account, not an expired human login.
Inspect the stalled step: GET logs-app-2024.01.01/_ilm/explain and read step_info for the missing privilege or the referenced index.lifecycle.rollover_alias.
Confirm the token holds manage on the target pattern — without it the coordinator cannot execute POST /<index>/_rollover.
After granting the privilege, re-run the failed step: POST /logs-app-2024.01.01/_ilm/retry.

`403 Forbidden` on policy `PUT`/update from CI/CD

Symptom: the pipeline fails to apply a lifecycle definition despite valid credentials. Resolution:

Read the role back: GET _security/role/ilm_automation_role and verify the cluster array explicitly includes manage_ilm. Index manage alone does not grant policy authority.
Confirm manage_index_templates is present — attaching a policy through PUT _index_template/... rejects the lifecycle block without it.
If using an API key, check that role_descriptors did not narrow away the privilege you need; the key’s effective grant is the intersection of the key descriptor and the backing role.

Unauthorized shard reallocation fails during a cold transition

Symptom: indices refuse to route to cold nodes despite correct index.routing.allocation.require settings. Resolution:

Check allocation reasoning: GET _cluster/allocation/explain.
Verify node attributes match the policy’s routing rules and that the acting token holds manage on the pattern being relocated — the coordinator evaluates privileges at execution time, not at policy creation.
Grant explicit manage to the execution role and retry the phase. Persistent routing fallbacks are covered under fallback routing for data retention.

FAQ

Why separate the authoring role from the execution role?

Because they need different privileges and carry different risk. Authoring a policy is a privileged cluster act (manage_ilm, manage_index_templates) performed by a CI account that never writes documents; executing the lifecycle is an index act (manage, write) performed by the pipeline that owns the data. Collapsing both into one admin role means an application token that only needs to index logs can also delete policies and reroute shards. Splitting them keeps each identity’s blast radius bounded to what it legitimately does.

Does index-level `manage` let a token edit ILM policies?

No. Index manage lets a token roll, shrink, and force-merge the indices it matches, but the policy object itself is a top-level cluster resource. Creating or updating a policy with PUT _ilm/policy/... requires the top-level cluster privilege manage_ilm, and attaching it via a template requires manage_index_templates. A token with only index manage receives 403 on those calls — verify the distinction with GET _security/role/<name>.

Should automation use API keys or a service account password?

Use an API key derived from a service account for CI/CD. The key can narrow the account’s role further through role_descriptors (the effective grant is the intersection of key and role) and carries an expiration, so a leaked key ages out on its own. The service account password should exist only to mint and rotate keys, never to authenticate day-to-day jobs. Human credentials should never appear in an automation pipeline at all.

How do I confirm a token has enough privilege before a phase runs?

Call the has-privileges API from the token’s own credentials: POST /_security/user/_has_privileges with the cluster and index privileges the step needs. A "has_all_requested": true response proves the token can both manage the policy and manage the target indices before ILM ever tries. This turns a silent stalled phase into a pre-deployment check, and is far faster than waiting for the next poll interval to surface a security_exception.

Why does ILM check privileges at execution time instead of policy creation?

Because the coordinator runs each phase step as the identity that owns the managed index, and that identity may differ from whoever authored the policy. A policy created by a CI account can still fail its rollover if the pipeline’s service account lacks manage on the write alias. Execution-time evaluation is what makes least-privilege enforcement real — but it also means you must verify the execution identity’s privileges, not just the author’s, with GET _security/_authenticate and the has-privileges API.

Configuring index rollover conditions — the rollover step needs manage on the write alias, so its privileges and this role must line up.
Understanding hot-warm-cold architecture — tier-migration steps that require manage on the pattern being relocated.
ILM policy JSON template for beginners — a version-controlled baseline policy safe for a scoped account to apply.
Building custom ILM policies via the API — the programmatic authoring this role is scoped to perform.
Monitoring ILM execution and error states — where a security_exception from a missing privilege first surfaces.

← Back to ILM Architecture & Fundamentals

Securing ILM Policies with RBAC #

Prerequisites #

Architecture: Separating Authoring from Application #

Configuration Reference #

Step-by-Step Implementation #

Verification #

Threshold Tuning & Operational Guidance #

Troubleshooting #

ILM phase stuck in hot/rollover with a security_exception #

403 Forbidden on policy PUT/update from CI/CD #

Unauthorized shard reallocation fails during a cold transition #

FAQ #

Related #

Explore deeper

Related in ILM Architecture