Setting Up ILM for Multi-Tenant Log Analytics

Multi-tenant log architectures demand strict data isolation, predictable storage tiering, and deterministic retention. When Elasticsearch ILM Architecture & Fundamentals are applied uniformly across heterogeneous tenants, mapping collisions, skewed shard distribution, and indefinite lifecycle stalls become inevitable. Production-grade ILM requires tenant-scoped policies, composite rollover triggers, and automated recovery workflows that guarantee idempotency during policy drift or cluster degradation. This guide enforces strict diagnostic protocols, safe manual intervention, and automated Python v8+ recovery patterns. Compliance and rapid restoration are non-negotiable.

flowchart TD
  T1["Tenant A logs"] --> TPL["Index template (logs-tenant-*)"]
  T2["Tenant B logs"] --> TPL
  TPL --> POL["ILM policy: tenant-hot-warm-cold"]
  POL --> H["Hot tier"]
  POL --> WM["Warm tier"]
  POL --> CD["Cold tier"]

1. Tenant-Scoped Policy Architecture & Allocation

Monolithic ILM policies fail under multi-tenant load. Divergent ingestion volumes, retention SLAs, and query patterns require explicit tenant binding at the template level. You must enforce _tier_preference routing, disable hot-tier replicas, and mandate parse_origination_date to prevent lifecycle miscalculations.

PUT _index_template/logs-tenant-template
{
  "index_patterns": ["logs-tenant-*"],
  "priority": 200,
  "template": {
    "settings": {
      "index.lifecycle.name": "tenant-hot-warm-cold",
      "index.lifecycle.parse_origination_date": true,
      "index.routing.allocation.include._tier_preference": "data_hot",
      "number_of_shards": 3,
      "number_of_replicas": 0,
      "index.codec": "best_compression"
    },
    "mappings": {
      "dynamic": "strict",
      "properties": {
        "tenant_id": { "type": "keyword" },
        "@timestamp": { "type": "date" },
        "message": { "type": "text" }
      }
    }
  }
}

parse_origination_date: true is mandatory. Without it, ILM calculates rollover age from index creation time, causing premature rollovers for backfilled data and delayed transitions for sparse tenants. Hot-tier replicas must remain at 0 to preserve write throughput. Replicas scale during warm transition via ILM actions, not template defaults.

Verify cluster allocation readiness immediately after template deployment:

GET _cluster/health?wait_for_status=yellow&timeout=10s

Expected Output:

{
  "cluster_name": "prod-logs",
  "status": "yellow",
  "timed_out": false,
  "number_of_nodes": 9,
  "active_primary_shards": 142,
  "active_shards": 142,
  "relocating_shards": 0,
  "initializing_shards": 0,
  "unassigned_shards": 0,
  "delayed_unassigned_shards": 0,
  "number_of_pending_tasks": 0,
  "number_of_in_flight_fetch": 0,
  "task_max_waiting_in_queue_millis": 0,
  "active_shards_percent_as_number": 100.0
}

A yellow status is acceptable during hot-phase ingestion. red status or unassigned_shards > 0 indicates immediate routing misconfiguration.

2. Deterministic Rollover & Conflict Prevention

Single-trigger rollovers (max_age or max_docs) create either oversized primary shards or excessive index proliferation. You must implement composite thresholds that adapt to tenant volume. Reference Configuring Index Rollover Conditions for threshold calibration against your cluster’s JVM heap and segment merge capacity.

PUT _ilm/policy/tenant-hot-warm-cold
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_primary_shard_size": "30gb",
            "max_age": "24h"
          },
          "set_priority": { "priority": 100 }
        }
      },
      "warm": {
        "min_age": "7d",
        "actions": {
          "allocate": {
            "number_of_replicas": 1,
            "include": { "_tier_preference": "data_warm" }
          },
          "shrink": { "number_of_shards": 1 },
          "forcemerge": { "max_num_segments": 1 },
          "set_priority": { "priority": 50 }
        }
      },
      "cold": {
        "min_age": "30d",
        "actions": {
          "allocate": {
            "number_of_replicas": 0,
            "include": { "_tier_preference": "data_cold" }
          },
          "set_priority": { "priority": 0 }
        }
      },
      "delete": {
        "min_age": "90d",
        "actions": {
          "delete": { "delete_searchable_snapshot": true }
        }
      }
    }
  }
}

The shrink action reduces shard count to 1 in the warm phase, eliminating cross-node scatter/gather overhead for historical queries. forcemerge to 1 segment is executed post-shrink to maximize cold-tier compression. Priority scaling (100500) ensures hot indices receive preferential thread pool allocation during cluster contention.

3. Cluster State Diagnostics & Safe Manual Reroutes

ILM stalls occur when allocation constraints conflict with node availability, or when policy steps fail due to transient network partitions. Diagnose immediately using _ilm/explain.

GET logs-tenant-acme-000001/_ilm/explain

Stalled Output Example:

{
  "indices": {
    "logs-tenant-acme-000001": {
      "index": "logs-tenant-acme-000001",
      "managed": true,
      "policy": "tenant-hot-warm-cold",
      "lifecycle_date_millis": 1704067200000,
      "phase": "warm",
      "phase_time_millis": 1704153600000,
      "action": "allocate",
      "step": "check-allocation",
      "step_time_millis": 1704153600000,
      "step_info": {
        "type": "illegal_argument_exception",
        "reason": "cannot allocate shard [logs-tenant-acme-000001][0] on any node matching [include._tier_preference=data_warm]"
      }
    }
  }
}

When step_info indicates allocation failure, verify warm node tags:

GET _cat/nodeattrs?v&h=name,attr,value&s=attr

If nodes are correctly tagged but shards remain unassigned, execute a safe manual reroute. Never use allocate_stale_primary unless data loss is explicitly acceptable. Use allocate_empty_primary only for index recreation, or move for load balancing:

POST _cluster/reroute
{
  "commands": [
    {
      "allocate_replica": {
        "index": "logs-tenant-acme-000001",
        "shard": 0,
        "node": "warm-node-01",
        "allow_primary": false
      }
    }
  ]
}

After reroute, clear the ILM error state:

POST logs-tenant-acme-000001/_ilm/retry

4. Automated Python v8+ Recovery Patterns

Manual intervention does not scale. Deploy the following Python v8+ recovery script to monitor stuck phases, validate allocation constraints, and trigger automated retries with exponential backoff. The script uses the official elasticsearch client and adheres to strict compliance logging.

import logging
import time
from elasticsearch import Elasticsearch, ApiError, ConnectionError
from typing import Dict, List

logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
logger = logging.getLogger("ilm_recovery")

class ILMRecoveryAgent:
    def __init__(self, es_hosts: List[str], api_key: str):
        self.es = Elasticsearch(
            hosts=es_hosts,
            api_key=api_key,
            request_timeout=30,
            max_retries=3,
            retry_on_timeout=True
        )

    def get_stuck_indices(self, phase: str = "warm", max_stuck_hours: int = 2) -> List[str]:
        """Identify indices stalled beyond threshold."""
        try:
            explain = self.es.ilm.explain_lifecycle(index="logs-tenant-*")
            stuck = []
            current_ts = time.time() * 1000
            for idx, data in explain["indices"].items():
                if data.get("phase") == phase:
                    step_time = data.get("step_time_millis", 0)
                    if (current_ts - step_time) > (max_stuck_hours * 3_600_000):
                        stuck.append(idx)
            return stuck
        except (ConnectionError, ApiError) as e:
            logger.error(f"Failed to fetch ILM explain: {e}")
            return []

    def retry_stuck_phase(self, index: str) -> bool:
        """Attempt ILM retry with allocation validation."""
        try:
            self.es.cluster.health(index=index, wait_for_status="yellow", timeout="10s")
            self.es.ilm.retry(index=index)
            logger.info(f"Triggered ILM retry for {index}")
            return True
        except ApiError as e:
            logger.warning(f"Retry failed for {index}: {e.meta.status} - {e.body}")
            return False

    def run_recovery_cycle(self):
        """Execute automated recovery with backoff."""
        stuck_indices = self.get_stuck_indices()
        if not stuck_indices:
            logger.info("No stuck indices detected.")
            return

        for idx in stuck_indices:
            success = False
            for attempt in range(3):
                time.sleep(2 ** attempt)
                if self.retry_stuck_phase(idx):
                    success = True
                    break
            if not success:
                logger.critical(f"ESCALATION REQUIRED: {idx} failed automated recovery after 3 attempts.")

if __name__ == "__main__":
    agent = ILMRecoveryAgent(
        es_hosts=["https://es-node-01:9200", "https://es-node-02:9200"],
        api_key="YOUR_SERVICE_ACCOUNT_API_KEY"
    )
    agent.run_recovery_cycle()

For advanced async implementations, consult the Elasticsearch Python Client v8 Documentation. Schedule this script via cron or Kubernetes CronJob at 5-minute intervals during peak ingestion windows.

5. Escalation Paths & Compliance Enforcement

Automated recovery handles transient failures. Systemic degradation requires immediate escalation. Adhere to the following decision matrix:

ConditionActionSLA
unassigned_shards > 5% of totalHalt ingestion pipelines, verify _cluster/allocation/explain15 min
ILM step_info indicates disk_watermark breachExpand data nodes or purge expired snapshots30 min
retry_failed indices > 10 across tenantsEngage platform engineering, force manual _cluster/reroute1 hour
Cluster health red for > 2 hoursActivate DR failover, isolate affected tenant indices2 hours

All ILM modifications must be audited. Enable audit.enabled: true in elasticsearch.yml and route security logs to a dedicated compliance index. RBAC policies must restrict manage_ilm privileges to service accounts only. Never grant cluster:admin/ilm to tenant-scoped API keys.

Setting Up ILM for Multi-Tenant Log Analytics is not a configuration exercise; it is a continuous operational discipline. Enforce strict template boundaries, validate allocation constraints before deployment, and maintain automated recovery pipelines. Deviation from these protocols guarantees index fragmentation, query latency degradation, and compliance violations. Execute precisely. Monitor relentlessly. Restore immediately.