One shared ILM policy or one policy per tenant?

Prefer one shared policy referenced by per-tenant templates. The rollover action's OR-combined max_primary_shard_size and max_age triggers adapt to each tenant's volume, so a single policy handles busy and quiet tenants. Split only when retention SLAs genuinely differ.

Setting Up ILM for Multi-Tenant Log Analytics

Give every tenant its own index template, write alias, and lifecycle binding so a single tenant’s ingest spike cannot skew shard distribution, breach a shared size cap early, or stall another tenant’s phase transitions.

A shared cluster serving many log-producing tenants is where a naive one-policy-fits-all lifecycle breaks down first. Divergent ingest volumes, retention SLAs, and query patterns collide: a single monolithic policy caps every tenant at the same shard size, ages every index on the same clock, and lets the busiest tenant dictate rollover timing for everyone. The fix is namespace isolation at the template layer, feeding tenant-scoped rollover conditions into a single reusable policy. This page sits inside the rollover-conditions workflow of Elasticsearch ILM Architecture & Fundamentals: rollover is still the first lifecycle transition, but here it is scoped per tenant so thresholds never collide.

Prerequisites

Elasticsearch 8.x with data-tier node roles assigned (data_hot, data_warm, data_cold) so rolled-over indices migrate cleanly through the hot-warm-cold architecture.
elasticsearch-py v8.0+ — this page uses the v8 client surface (ilm.explain_lifecycle, ilm.retry, typed exceptions), not the legacy body= pattern.
One write alias per tenant (logs-tenant-acme, logs-tenant-globex, …), each bootstrapped onto an index ending in -000001.
manage_ilm and manage_index_templates privileges on the automation account, scoped per Securing ILM Policies with RBAC — never granted to a tenant-facing API key.

Implementation: Tenant-Scoped Template and Composite Policy

The isolation boundary is the index template. Every tenant’s indices match the same logs-tenant-* pattern but carry a tenant_id keyword, a strict mapping to reject field-mapping collisions between tenants, and parse_origination_date so ILM ages backfilled data from the log timestamp rather than index creation time.

PUT _index_template/logs-tenant-template
{
  "index_patterns": ["logs-tenant-*"],
  "priority": 200,
  "template": {
    "settings": {
      "index.lifecycle.name": "tenant-hot-warm-cold",
      "index.lifecycle.parse_origination_date": true,   // age from log time, not create time
      "index.routing.allocation.include._tier_preference": "data_hot",
      "number_of_shards": 3,
      "number_of_replicas": 0,                            // no hot-tier replicas — protect write throughput
      "index.codec": "best_compression"
    },
    "mappings": {
      "dynamic": "strict",                                // reject unknown fields — stop cross-tenant mapping drift
      "properties": {
        "tenant_id":  { "type": "keyword" },
        "@timestamp": { "type": "date" },
        "message":    { "type": "text" }
      }
    }
  }
}

parse_origination_date: true is the setting most teams omit and most regret. Without it, ILM computes rollover age from index creation, so a backfill of week-old logs rolls over almost immediately while a sparse tenant’s index lingers past its retention intent. Hot-tier number_of_replicas stays at 0 to keep write throughput high; replicas are added later by the policy’s allocate action, not by the template.

The policy itself is shared — one definition, referenced by every tenant template — but its rollover action pairs a size trigger with an age trigger so it adapts to each tenant’s volume. A high-throughput tenant rolls on max_primary_shard_size; a quiet tenant rolls on max_age.

PUT _ilm/policy/tenant-hot-warm-cold
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_primary_shard_size": "30gb",  // busy tenants roll on size
            "max_age": "24h"                    // quiet tenants roll on age
          },
          "set_priority": { "priority": 100 }   // hot indices recover first after a restart
        }
      },
      "warm": {
        "min_age": "7d",                        // measured from rollover, not creation
        "actions": {
          "allocate": {
            "number_of_replicas": 1,
            "include": { "_tier_preference": "data_warm" }
          },
          "shrink":     { "number_of_shards": 1 },   // collapse to one primary once writes stop
          "forcemerge": { "max_num_segments": 1 },
          "set_priority": { "priority": 50 }
        }
      },
      "cold": {
        "min_age": "30d",
        "actions": {
          "allocate": {
            "number_of_replicas": 0,
            "include": { "_tier_preference": "data_cold" }
          },
          "set_priority": { "priority": 0 }
        }
      },
      "delete": {
        "min_age": "90d",
        "actions": {
          "delete": { "delete_searchable_snapshot": true }
        }
      }
    }
  }
}

The warm-phase shrink collapses each index to a single primary shard once ingestion has stopped, eliminating scatter/gather overhead on historical queries, and forcemerge runs afterward to maximise cold-tier compression. Priority scaling (100 → 50 → 0) means hot indices win thread-pool allocation when the deployment is under contention.

Verification

Immediately after applying the template, confirm allocation is healthy before any tenant starts ingesting:

GET _cluster/health?wait_for_status=yellow&timeout=10s

{
  "cluster_name": "prod-logs",
  "status": "yellow",
  "timed_out": false,
  "number_of_nodes": 9,
  "active_primary_shards": 142,
  "active_shards": 142,
  "relocating_shards": 0,
  "unassigned_shards": 0,
  "active_shards_percent_as_number": 100.0
}

A yellow status is expected during hot-phase ingestion because hot indices carry zero replicas by design. A red status, or unassigned_shards > 0, points to a tier-tag or routing misconfiguration that must be fixed before ingest starts. Next, confirm a specific tenant’s write index adopted the policy and is polling rollover conditions:

GET logs-tenant-acme-000001/_ilm/explain

"managed": true with "step": "check-rollover-ready" in the hot phase confirms the tenant is wired correctly. If _ilm/explain reports an ERROR step instead, the step_info block names the cause — most often an allocation mismatch:

{
  "indices": {
    "logs-tenant-acme-000001": {
      "managed": true,
      "policy": "tenant-hot-warm-cold",
      "phase": "warm",
      "action": "allocate",
      "step": "check-allocation",
      "step_info": {
        "type": "illegal_argument_exception",
        "reason": "cannot allocate shard [logs-tenant-acme-000001][0] on any node matching [include._tier_preference=data_warm]"
      }
    }
  }
}

When step_info reports an allocation failure, verify the warm-tier nodes are actually tagged before touching anything else:

GET _cat/nodeattrs?v&h=name,attr,value&s=attr

If nodes are correctly tagged but a shard is still unassigned, apply a controlled reroute. Never use allocate_stale_primary unless data loss is explicitly acceptable; use allocate_replica to place a stuck replica on a healthy warm node:

POST _cluster/reroute
{
  "commands": [
    { "allocate_replica": { "index": "logs-tenant-acme-000001", "shard": 0, "node": "warm-node-01" } }
  ]
}

Then clear the ILM error state so the policy resumes from the failed step:

POST logs-tenant-acme-000001/_ilm/retry

Automated Per-Tenant Recovery

Manual retries do not scale across dozens of tenants. The Python v8+ agent below scans every logs-tenant-* index, finds ones stalled in a phase beyond a threshold, and retries them with exponential backoff — escalating only what genuinely fails. It uses the v8 client’s typed exceptions and the correct ilm.explain_lifecycle / ilm.retry method names.

import logging
import time
from typing import List
from elasticsearch import Elasticsearch, ApiError, ConnectionError

logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
logger = logging.getLogger("ilm_recovery")


class ILMRecoveryAgent:
    def __init__(self, es_hosts: List[str], api_key: str):
        self.es = Elasticsearch(
            hosts=es_hosts,
            api_key=api_key,
            request_timeout=30,
            max_retries=3,
            retry_on_timeout=True,
        )

    def get_stuck_indices(self, phase: str = "warm", max_stuck_hours: int = 2) -> List[str]:
        """Return tenant indices halted in `phase` past the stall threshold."""
        try:
            explain = self.es.ilm.explain_lifecycle(index="logs-tenant-*")
        except (ConnectionError, ApiError) as exc:
            logger.error("Failed to fetch ILM explain: %s", exc)
            return []

        stuck, now_ms = [], time.time() * 1000
        for idx, data in explain["indices"].items():
            if data.get("phase") == phase:
                step_time = data.get("step_time_millis", 0)
                if (now_ms - step_time) > (max_stuck_hours * 3_600_000):
                    stuck.append(idx)
        return stuck

    def retry_stuck_phase(self, index: str) -> bool:
        """Validate allocation, then clear the ILM error state for one index."""
        try:
            self.es.cluster.health(index=index, wait_for_status="yellow", timeout="10s")
            self.es.ilm.retry(index=index)  # POST /<index>/_ilm/retry
            logger.info("Triggered ILM retry for %s", index)
            return True
        except ApiError as exc:
            logger.warning("Retry failed for %s: %s - %s", index, exc.status_code, exc.info)
            return False

    def run_recovery_cycle(self) -> None:
        stuck = self.get_stuck_indices()
        if not stuck:
            logger.info("No stuck indices detected.")
            return

        for idx in stuck:
            for attempt in range(3):
                time.sleep(2 ** attempt)  # exponential backoff: 1s, 2s, 4s
                if self.retry_stuck_phase(idx):
                    break
            else:
                logger.critical("ESCALATION REQUIRED: %s failed recovery after 3 attempts.", idx)


if __name__ == "__main__":
    agent = ILMRecoveryAgent(
        es_hosts=["https://es-node-01:9200", "https://es-node-02:9200"],
        api_key="YOUR_SERVICE_ACCOUNT_API_KEY",
    )
    agent.run_recovery_cycle()

Schedule the agent on a five-minute cron or a Kubernetes CronJob during peak ingestion windows. Because it targets the logs-tenant-* pattern, a single deployment covers every tenant without per-tenant configuration.

Gotchas and Edge Cases

Template priority collisions. If another template also matches logs-* at an equal or higher priority, its settings can override the tenant binding and leave indices unmanaged. Keep the tenant template at a distinct, higher priority and confirm the resolved settings with GET logs-tenant-acme-000001/_settings.
dynamic: strict rejects surprise fields. A tenant that suddenly ships a new log field will get strict_dynamic_mapping_exception rather than a silent mapping drift. That is the intended safety behaviour, but it means schema changes must go through a template update — not an ad-hoc document write.
Shared policy, per-tenant clock. Because min_age in the warm and cold phases counts from the rollover event, two tenants sharing tenant-hot-warm-cold transition on different wall-clock days depending on when each rolled over. Do not assume synchronized phase movement across tenants.
Deleting a searchable snapshot is irreversible. delete_searchable_snapshot: true removes the underlying snapshot when the index is deleted at 90 days. If a tenant’s retention SLA later extends, that data is already gone — version the policy before changing retention rather than editing in place.

Escalation and Compliance

Automated retries absorb transient failures; systemic degradation needs a human. Route on-call decisions through a fixed matrix, and audit every lifecycle change.

Condition	Action	SLA
`unassigned_shards` > 5% of total	Halt ingestion, run `GET _cluster/allocation/explain`	15 min
`step_info` indicates `disk_watermark` breach	Add data nodes or purge expired snapshots	30 min
`retry_failed` indices > 10 across tenants	Engage platform engineering, controlled `_cluster/reroute`	1 hour
Cluster health `red` for > 2 hours	Activate DR failover, isolate affected tenant indices	2 hours

Enable audit logging (xpack.security.audit.enabled: true in elasticsearch.yml) and route lifecycle events to a dedicated compliance index. Restrict manage_ilm to service accounts; a tenant API key should never hold cluster-level ILM privileges, which is exactly the separation the RBAC role matrix enforces.

FAQ

One shared policy or one policy per tenant?

Prefer one shared policy referenced by per-tenant templates. The rollover action’s OR-combined max_primary_shard_size and max_age triggers already adapt to each tenant’s volume, so a single tenant-hot-warm-cold policy handles both busy and quiet tenants. Split into separate policies only when retention SLAs genuinely differ — for example, one tenant contractually requires 30-day deletion while others keep 90 days.

How do I stop one tenant from starving others on the hot tier?

Give each tenant its own write alias and template so rollover thresholds are scoped, keep hot indices at number_of_replicas: 0 to protect write throughput, and lean on set_priority so hot shards win recovery and thread-pool allocation under contention. If a single tenant still dominates ingest, isolate it onto dedicated hot nodes with a tenant-specific _tier_preference tag.

Why is `parse_origination_date` important for multi-tenant backfills?

Tenants frequently backfill historical logs. Without parse_origination_date: true, ILM ages an index from its creation time, so a backfill of week-old data rolls over almost immediately and can be deleted before it is queried. With it enabled, ILM reads the origination date from the index name or document timestamps and ages data by when the logs actually occurred.

Configuring index rollover conditions — the parent workflow: how the OR-combined size and age triggers behave before you scope them per tenant.
Configuring rollover based on max primary shard size — sizing the max_primary_shard_size trigger each tenant rolls on.
Troubleshooting ILM phase transition delays — diagnosing the stalled check-allocation and warm-tier steps this page automates.
Securing ILM policies with RBAC — keeping manage_ilm off tenant-facing API keys.

← Back to Configuring Index Rollover Conditions · ILM Architecture & Fundamentals

Setting Up ILM for Multi-Tenant Log Analytics #

Prerequisites #

Implementation: Tenant-Scoped Template and Composite Policy #

Verification #

Automated Per-Tenant Recovery #

Gotchas and Edge Cases #

Escalation and Compliance #

FAQ #

Related #