Understanding Hot-Warm-Cold Architecture

A hot-warm-cold deployment exists to solve one economic problem: the newest 5% of your data absorbs 95% of the write load and almost all of the latency-sensitive queries, yet under a flat topology it sits on exactly the same hardware as the archival data that is never written to and rarely read. Paying for NVMe and high heap-per-shard across the entire retention window is the default failure of a single-tier design. The operational challenge this page addresses is how to route each index to storage that matches its access pattern — high-IOPS nodes while it is being written, dense read-optimized nodes while it is still queried, and cheap high-capacity nodes once it is only kept for compliance — and to have Index Lifecycle Management move indices between those tiers automatically, without a human touching allocation settings.

That routing is not a one-time placement decision. It is a continuous reconciliation driven by Index Lifecycle Management (ILM), which evaluates each managed index against a policy and rewrites its allocation requirements as it ages. Get the node attributes, index templates, and phase boundaries aligned and the tiers become invisible infrastructure; get any one of them wrong and indices pile up on the hot tier, shards go UNASSIGNED, and the disk watermark you were trying to avoid arrives faster than it would have on a flat cluster. This page walks the full path: prerequisites, how the tiers plug into the lifecycle state machine, the exact configuration, a step-by-step rollout, verification queries, sizing guidance, and the troubleshooting patterns for the transitions that stall.

Prerequisites

Confirm each of these before applying a tiered policy to a production cluster — a missing node role or an unmanaged template is the most common reason a hot-warm-cold rollout silently fails.

Elasticsearch 8.x running with a dedicated master quorum (three master-eligible nodes) so allocation decisions survive a node loss.
At least one data node per tier you intend to use, each carrying the matching data-tier role (data_hot, data_warm, data_cold) — a phase cannot complete if its target tier has no eligible node.
Disk watermarks reviewed on every tier: cluster.routing.allocation.disk.watermark.low, .high, and .flood_stage sized for the densest tier, not the default 85/90/95%.
An index template that sets index.lifecycle.name and index.lifecycle.rollover_alias, so every new backing index is managed from creation rather than orphaned.
A write alias or data stream in place — tier transitions only begin after the Hot phase rolls over, and rollover requires an alias with exactly one is_write_index.
elasticsearch-py 8.x installed (pip install "elasticsearch>=8,<9"); the v8 client enforces keyword-only arguments used throughout this page.
A service account with manage_ilm and manage_index_templates cluster privileges for the automation, kept separate from human read_ilm access.

How the Tiers Fit the Lifecycle State Machine

The tiers are the physical projection of the lifecycle phases. Each phase in the ILM state machine corresponds to a class of hardware, and the transition between phases is, mechanically, the shard allocator relocating an index’s shards from one tier to the next. The policy never names a specific node; it declares the tier requirement for the current phase, and the allocator resolves that against whichever nodes carry the matching role. Where rollover conditions decide when an index leaves the Hot phase, the tier allocation decides where it lands next — the two are configured together and belong to the same rollover conditions design step.

Read the diagram as a one-way ratchet driven by min_age, which ILM measures from the moment the index rolls over, not from when it was created. An index that fills slowly can sit in the Hot phase well past its nominal warm min_age simply because it has not rolled yet. Once rollover fires, the clock starts, and each subsequent boundary rewrites the index’s index.routing.allocation.include._tier_preference to point at the next tier down. The allocator then moves the shards; only when relocation completes does ILM run that phase’s remaining actions (shrink, forcemerge, replica reduction). This is why a full or misconfigured tier does not throw an immediate error — it holds the index in a waiting step until capacity appears.

Node Attribute and Role Configuration

Two mechanisms can pin an index to a tier, and modern deployments should prefer one over the other. The built-in data-tier roles (data_hot, data_warm, data_cold) are the recommended path: they integrate with _tier_preference fallback and searchable snapshots, and ILM’s migrate action targets them automatically. The older custom node attribute approach (node.attr.data: hot plus an explicit allocate.require filter) still works and is shown here because many existing clusters run it, but do not mix the two schemes for the same index — the allocator will try to satisfy both and can deadlock.

Node configuration (elasticsearch.yml):

# --- Hot tier node ---
node.roles: [ data_hot, ingest ]     # write path + latency-sensitive queries
node.attr.data: hot                   # custom attribute (only if using require filters)

# --- Warm tier node ---
node.roles: [ data_warm ]             # read-heavy, no ingest role needed
node.attr.data: warm

# --- Cold tier node ---
node.roles: [ data_cold ]             # dense, read-only archival
node.attr.data: cold

Keep master-eligible roles on separate dedicated nodes; co-locating master with data_hot means a heap spike from a large forcemerge can cost you the deployment’s coordination layer. With roles in place, the tier-preference chain gives you graceful fallback: setting index.routing.allocation.include._tier_preference: data_warm,data_hot tells the allocator to prefer warm nodes but fall back to hot ones if warm is full, so a temporarily saturated warm tier degrades to “stays hot a little longer” instead of an unassigned-shard incident.

The Tiered ILM Policy

The policy is a single declarative object; each phase boundary is annotated below to show exactly what it controls. This is the shape you would create through the ILM policy design workflow and version in source control rather than editing live.

{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_primary_shard_size": "50gb",
            "max_age": "7d"
          },
          "set_priority": { "priority": 100 }
        }
      },
      "warm": {
        "min_age": "7d",
        "actions": {
          "allocate": {
            "require": { "data": "warm" },
            "number_of_replicas": 1
          },
          "shrink": { "number_of_shards": 1 },
          "forcemerge": { "max_num_segments": 1 },
          "set_priority": { "priority": 50 }
        }
      },
      "cold": {
        "min_age": "30d",
        "actions": {
          "allocate": {
            "require": { "data": "cold" },
            "number_of_replicas": 0
          },
          "readonly": {},
          "set_priority": { "priority": 0 }
        }
      },
      "delete": {
        "min_age": "90d",
        "actions": { "delete": {} }
      }
    }
  }
}

Two ordering facts matter here. First, actions inside a phase run in a fixed internal order — allocation and priority are applied before destructive actions like shrink and forcemerge, regardless of how you arrange the JSON — so an index always finishes relocating to the warm tier before it is shrunk on it. Second, forcemerge on the warm tier is I/O- and heap-intensive; running it right after shrink on nodes that also serve queries is deliberate, because warm nodes have spare CPU that hot nodes do not. Dropping number_of_replicas to 0 in the cold phase halves disk use but removes redundancy, which is only safe when the data is also covered by a snapshot repository.

Step-by-Step Implementation

Roll the topology out in an order that never leaves an index unmanaged or a tier without capacity.

1. Register the policy. Create the lifecycle definition before any template references it.

PUT _ilm/policy/logs-hot-warm-cold
{
  "policy": { "phases": { "hot": { "min_age": "0ms", "actions": {
    "rollover": { "max_primary_shard_size": "50gb", "max_age": "7d" } } } } }
}

2. Bind the policy through an index template. Every index created from the pattern inherits the policy and write alias, so nothing is ever managed by hand.

PUT _index_template/logs-tiered
{
  "index_patterns": ["logs-app-*"],
  "template": {
    "settings": {
      "index.lifecycle.name": "logs-hot-warm-cold",
      "index.lifecycle.rollover_alias": "logs-app-write",
      "index.routing.allocation.include._tier_preference": "data_hot",
      "number_of_shards": 3,
      "number_of_replicas": 1
    }
  }
}

3. Bootstrap the first backing index and write alias. ILM cannot roll over an alias that does not yet exist.

PUT logs-app-000001
{ "aliases": { "logs-app-write": { "is_write_index": true } } }

4. Automate the rollout with the Python v8+ client. The script below deploys the policy and template idempotently and, for legacy indices that predate ILM, reindexes them into the managed alias so they join the tiered lifecycle instead of bypassing retention.

import logging
from elasticsearch import Elasticsearch
from elasticsearch.exceptions import ApiError, BadRequestError, ConnectionError

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("hot_warm_cold")

es = Elasticsearch(
    "https://es-node-01:9200",
    api_key=("id", "api_key_string"),   # least-privilege service account, not basic auth
    verify_certs=True,
    ca_certs="/path/to/ca.crt",
    request_timeout=30,
    max_retries=3,
    retry_on_timeout=True,
)

POLICY_NAME = "logs-hot-warm-cold"
TEMPLATE_NAME = "logs-tiered"

POLICY = {
    "phases": {
        "hot": {"min_age": "0ms", "actions": {
            "rollover": {"max_primary_shard_size": "50gb", "max_age": "7d"},
            "set_priority": {"priority": 100}}},
        "warm": {"min_age": "7d", "actions": {
            "allocate": {"require": {"data": "warm"}, "number_of_replicas": 1},
            "shrink": {"number_of_shards": 1},
            "forcemerge": {"max_num_segments": 1},
            "set_priority": {"priority": 50}}},
        "cold": {"min_age": "30d", "actions": {
            "allocate": {"require": {"data": "cold"}, "number_of_replicas": 0},
            "readonly": {}, "set_priority": {"priority": 0}}},
        "delete": {"min_age": "90d", "actions": {"delete": {}}},
    }
}

def deploy() -> None:
    """Create or replace the policy and template idempotently (PUT semantics)."""
    try:
        es.ilm.put_lifecycle(name=POLICY_NAME, policy=POLICY)
        logger.info("Policy '%s' applied.", POLICY_NAME)
    except BadRequestError as exc:
        logger.error("Policy rejected as invalid: %s", exc.info)
        raise

    es.indices.put_index_template(
        name=TEMPLATE_NAME,
        index_patterns=["logs-app-*"],
        template={
            "settings": {
                "index.lifecycle.name": POLICY_NAME,
                "index.lifecycle.rollover_alias": "logs-app-write",
                "index.routing.allocation.include._tier_preference": "data_hot",
                "number_of_shards": 3,
                "number_of_replicas": 1,
            },
            "mappings": {"properties": {"@timestamp": {"type": "date"}}},
        },
    )
    logger.info("Template '%s' applied.", TEMPLATE_NAME)

def reindex_unmanaged(source_pattern: str, write_alias: str) -> None:
    """Fold pre-ILM indices into the managed alias so they inherit the tiered policy."""
    try:
        resp = es.reindex(
            source={"index": source_pattern},
            dest={"index": write_alias, "op_type": "create"},
            conflicts="proceed",
            wait_for_completion=False,   # returns a task id for long migrations
        )
        logger.info("Reindex task submitted: %s", resp["task"])
    except ApiError as exc:
        logger.error("Reindex failed: %s", exc.error)

if __name__ == "__main__":
    try:
        deploy()
        if not es.indices.exists_alias(name="logs-app-write"):
            es.indices.create(
                index="logs-app-000001",
                aliases={"logs-app-write": {"is_write_index": True}},
            )
        reindex_unmanaged("logs-app-legacy-*", "logs-app-write")
    except ConnectionError as exc:
        logger.error("Cluster unreachable: %s", exc)

Verifying Tier Placement

Never assume a phase completed — confirm the shards actually moved. These three read calls answer the only questions that matter after a transition.

Where is each index in its lifecycle?

GET logs-app-000001/_ilm/explain

Check that phase matches the tier you expect and that step is complete rather than ERROR. If step is ERROR, the step_info object names the underlying cause.

Did the shards land on the right tier?

GET _cat/shards/logs-app-000001?v&h=index,shard,prirep,state,node

Cross-reference the node column against your tier naming — warm-phase shards should sit on data_warm nodes. A shard still on a hot node after the warm min_age means the allocator could not place it.

Why won’t a shard move? When a transition hangs on allocate, this is the definitive answer:

GET _cluster/allocation/explain
{ "index": "logs-app-000001", "shard": 0, "primary": true }

The response’s deciders array explains each rejection — most often a disk_threshold decider on a warm tier that is over its high watermark, or a filter decider when no node carries the required attribute or role.

Threshold Tuning and Performance Guidance

The single most consequential number is primary shard size. Target 30–50 GB per primary: below ~10 GB you fragment the deployment into thousands of tiny shards that bloat cluster state and slow master-node updates; above ~50 GB, recovery, relocation, and forcemerge all stretch into hours and a single hot shard can dominate query latency. max_primary_shard_size is the rollover condition that holds this line under variable ingest far better than max_docs or max_age alone — see configuring rollover based on max primary shard size for the full sizing matrix against disk IOPS and heap.

Size the phase boundaries to your hardware, not to round numbers. The warm tier’s job is to absorb shrink and forcemerge, both of which briefly need heap and disk headroom roughly equal to the index being merged — so a warm node needs free disk of at least the largest single index it will hold, and enough heap that a one-segment forcemerge does not push it toward a circuit-breaker trip. Keep heap at or below ~50% of RAM and never above ~31 GB (the compressed-oops boundary), and let the filesystem cache use the rest, which is what makes dense warm and cold nodes fast for reads despite spinning or cheaper storage. On the cold tier, dropping to zero replicas is the main lever: it halves disk footprint at the cost of redundancy, acceptable only when the data is also in a snapshot repository you can restore from.

Finally, mind the poll interval. ILM evaluates the state machine every indices.lifecycle.poll_interval (default 10m), so a transition can lag its min_age by up to that interval — a feature in production, but shorten it in test clusters to iterate faster. Do not shorten it below what the deployment can service; every poll re-evaluates every managed index.

Troubleshooting Stuck Tier Transitions

Most “ILM is broken” reports are really allocation problems wearing an ILM costume. Diagnose from GET <index>/_ilm/explain first, read step_info, then match the symptom below.

Symptom	Root cause	Remediation
Transition to warm/cold hangs on `allocate`, index stays on hot	Target tier has no node with the required role/attribute, or is over the high disk watermark	Add a node to the tier or free disk; confirm with `GET _cluster/allocation/explain`; verify roles with `GET _cat/nodes?v&h=name,node.role`
`step` is `ERROR` at `shrink`	Not all primaries co-located on one node, or `number_of_shards` is not a factor of the source count	Ensure a copy of every shard is on one node and set `index.blocks.write: true`; pick a target shard count that divides the source; then `POST /<index>/_ilm/retry`
`forcemerge` never finishes, warm-node heap spikes	Warm node undersized for a single-segment merge, or forcemerge overlaps a snapshot window	Raise `max_num_segments` to `2`, add warm heap, or stagger SLM and ILM schedules so heavy I/O does not collide
Shards go `UNASSIGNED` after a phase change	Both a data-tier role and a conflicting custom `require` filter are set on the same index	Standardize on one scheme; clear the stale `index.routing.allocation.*` setting, then retry
Index never leaves Hot despite age	It has not rolled over — write alias missing or `is_write_index` unset	Repair the alias so exactly one index has `is_write_index: true`; `min_age` only starts counting at rollover
Whole tier suddenly read-only	Flood-stage disk watermark breached on that tier	Free or add disk, then clear `index.blocks.read_only_allow_delete` once usage drops below the high watermark

When a step sits in ERROR after you have fixed the underlying condition, re-run it for that one index with POST /<index>/_ilm/retry; retry re-runs the failed step, so retrying without fixing the cause simply returns the index to ERROR. For continuous detection of these states across many indices rather than one-off triage, wire in the patterns from monitoring ILM execution and error states. And if a tier failure threatens a retention SLA, fallback routing for data retention covers redirecting reads to snapshot-mounted indices while you recover.

Frequently Asked Questions

Should I use data-tier roles or custom node.attr.data attributes?

Prefer the built-in data-tier roles (data_hot, data_warm, data_cold). They integrate with _tier_preference fallback, ILM's migrate action, and searchable snapshots without hand-written routing rules. Custom attributes with explicit allocate.require filters still work and appear in many older clusters, but do not apply both schemes to the same index — the allocator tries to satisfy both requirements at once and can deadlock the shard.

Do I need all three tiers to benefit from this architecture?

No. Hot-plus-warm is the most common production shape and already captures most of the cost saving, because the write-heavy hot tier is the expensive one. Add a cold tier only when you retain data long enough that dense, replica-free, read-only storage pays off, and a frozen tier with searchable snapshots when you must keep data queryable but rarely touch it. Omit any phase you do not need — the policy simply skips straight to the next defined phase.

Why is my index still on the hot tier long after its warm min_age?

Phase min_age is measured from the index's rollover time, not its creation time, so a slow-filling index that has not rolled over yet has an effective phase age of zero. Check GET <index>/_ilm/explain for age and whether rollover has fired. If the write alias is misconfigured, rollover never happens and no later phase — or tier move — ever begins.

Is it safe to set number_of_replicas to 0 in the cold phase?

Only when the data is also protected by a snapshot repository. Dropping to zero replicas halves the cold tier's disk footprint, which is usually the point, but it removes in-cluster redundancy — a single node loss then means data loss until you restore. Pair a zero-replica cold phase with snapshot lifecycle management so the durability moved off the live cluster still exists somewhere recoverable.

Can I change tier boundaries on a policy that is already managing indices?

Yes. put_lifecycle updates the policy in place and bumps its version; already-managed indices pick up the new definition at their next step evaluation, though an index mid-phase finishes its current action under the version it started with. Because live edits are hard to audit, route every boundary change through a version-controlled apply step rather than editing in Kibana Dev Tools.

Configuring Index Rollover Conditions — the Hot-phase trigger that decides when an index leaves the hot tier.
How to Configure Rollover Based on Max Primary Shard Size — the sizing matrix behind stable shards.
Securing ILM Policies with RBAC — least-privilege control over the policies that drive tier moves.
Fallback Routing for Data Retention — surviving a tier failure without breaking retention SLAs.
Monitoring ILM Execution & Error States — detecting stuck tier transitions across many indices.

← Back to ILM Architecture & Fundamentals

Understanding Hot-Warm-Cold Architecture #

Prerequisites #

How the Tiers Fit the Lifecycle State Machine #

Node Attribute and Role Configuration #

The Tiered ILM Policy #

Step-by-Step Implementation #

Verifying Tier Placement #

Threshold Tuning and Performance Guidance #

Troubleshooting Stuck Tier Transitions #

Frequently Asked Questions #

Related #

Explore deeper

Related in ILM Architecture