Automated Reindexing Pipelines & Workflows

Automated reindexing pipelines rebuild Elasticsearch indices in place — remapping fields, resharding, and consolidating data streams — without dropping a single write or search request. In production, reindexing is not an ad-hoc administrative chore but a stateful, idempotent data migration that must synchronize with Index Lifecycle Management (ILM) policies, respect cluster resource boundaries, and guarantee an atomic alias cutover. The architecture and execution patterns below are engineered for tiered hot/warm/cold deployments, high-throughput log analytics ingestion, and search index schema migrations where a botched rollover means either data loss or a deployment-wide latency spike.

This guide sits alongside the two core domains of the site: the ILM architecture and fundamentals that govern how indices age, and ILM policy design and lifecycle synchronization that keeps those policies deterministic across environments. Reindexing is where both meet a live data plane.

Reindex Workflow at a Glance

A safe reindex is a short state machine. The source index is frozen for writes but kept readable, a fresh target is provisioned with write-optimized settings and the correct lifecycle policy attached before any document moves, the copy runs throttled and idempotently, and only then is the write alias swapped in a single cluster-state update. Production settings are restored last, and the pipeline verifies lifecycle alignment before decommissioning the source.

The write alias moves in a single cluster-state update — step 5 is the only irreversible transition.

Core Architecture: Node Roles & Reindex Settings

Reindexing is a coordinating-node operation. The node that receives the _reindex request drives a scroll over the source and issues bulk writes to the target, so its heap, network throughput, and bulk thread pool bound the whole migration. In large clusters, route reindex traffic to dedicated coordinating-only nodes to keep scroll contexts and bulk queues off the data nodes serving production queries. Where the source and target live on different tiers, shard allocation is what physically places the new primaries: the target index’s index.routing.allocation.require.data attribute must resolve to nodes that actually have capacity, or the new shards sit UNASSIGNED while the reindex stalls on write rejections.

The hot-warm-cold architecture also dictates where a reindex belongs in the lifecycle. Reindexing a write-heavy index should target hot-tier nodes with high-IOPS NVMe storage; a one-off consolidation of historical data can target warm nodes directly. The settings that matter most on the target are deliberately non-default during the copy:

PUT /events-2026-000002
{
  "settings": {
    "index.number_of_shards": 6,
    "index.number_of_replicas": 0,
    "index.refresh_interval": "-1",
    "index.translog.durability": "async",
    "index.routing.allocation.require.data": "hot",
    "index.lifecycle.name": "events-ilm",
    "index.lifecycle.rollover_alias": "events-write"
  }
}

number_of_replicas: 0 removes replication overhead from every bulk write; refresh_interval: -1 suppresses per-batch segment generation; translog.durability: async trades a small durability window for throughput during a migration you can restart anyway. All three are reverted after cutover. Keeping number_of_shards deliberate — rather than inherited — is often the entire reason for the reindex: it is the only way to change primary shard count on an existing index short of a shrink or split.

Reindex Workflow Mechanics

Phase transitions during a reindex require explicit lifecycle anchoring, because ILM will not retroactively adopt an index that was created without policy metadata. The ordered sequence below is the canonical zero-downtime workflow.

Lock the source. Set index.blocks.write: true to freeze ingestion while preserving read availability. This prevents write amplification and gives the scroll a stable snapshot of the data.
```
PUT /events-2026-000001/_settings
{ "index.blocks.write": true }
```
Provision the target. Create it with the write-optimized settings shown above. Match number_of_shards to your target shard-sizing goal (30–50 GB per primary shard is the usual Lucene sweet spot), not to the source.
Attach the policy and alias. Apply index.lifecycle.name and index.lifecycle.rollover_alias to the target before data transfer begins. ILM metadata must be mapped at creation time or the new index inherits cluster defaults and silently bypasses your configured phase actions. The exact policy payload — and how to version it across staging and production — is covered in building custom ILM policies via API, and the rollover conditions you set here determine when the target itself will next roll over.
Reindex idempotently. Run the copy with op_type: "create" so that documents already present on the target from a previous failed attempt are skipped rather than overwritten, and conflicts: "proceed" so a single version conflict does not abort the whole task.

Swap the alias atomically. Execute one _aliases request that removes the write alias from the source and adds it to the target in the same cluster-state update, eliminating any routing gap.

POST /_aliases
{
  "actions": [
    { "remove": { "index": "events-2026-000001", "alias": "events-write" } },
    { "add":    { "index": "events-2026-000002", "alias": "events-write", "is_write_index": true } }
  ]
}

Restore and verify. Revert number_of_replicas and refresh_interval, then confirm the target is under lifecycle control with GET /events-2026-000002/_ilm/explain.

Failure to synchronize ILM metadata mid-pipeline produces policy drift: the new index runs on default settings, ignores your retention rules, and never rolls over. Always validate GET /<target>/_ilm/explain post-cutover to confirm the index reports the expected phase and, where applicable, an armed rollover action.

Security & Governance

A reindex pipeline is a privileged data-plane actor: it creates indices, mutates aliases that route production writes, and can delete the source. Treat its credentials accordingly. The automation account should hold a narrowly scoped role — create_index, manage and write on the target index pattern, read on the source, and manage_ilm only if the pipeline also attaches policies — rather than a deployment-wide superuser API key. Policy ownership is governed by the same controls described in securing ILM policies with RBAC: the identity that runs reindexes should not also be the identity that can rewrite lifecycle definitions, so that a compromised pipeline token cannot quietly extend or bypass retention.

Every alias swap and index deletion must land in an external audit log. Because the cutover is the one irreversible step, wrap it in a CI/CD job that records the source and target index names, the operator or service account, and the pre-swap document counts of both indices. Version the reindex specification itself as infrastructure-as-code so migrations are reviewable and replayable, and gate destructive steps (source deletion) behind a manual approval or an enforced observation window. When compliance requires immutable retention, keep a read-only alias on the decommissioned source rather than deleting it outright, and route archival reads through the fallback routing for data retention patterns.

Production Automation with Python v8+

Reindex automation must survive network partitions, partial task failures, and cluster restarts. The elasticsearch-py v8 client submits reindexes asynchronously and returns a server-assigned task id, which combines with op_type: "create" to make the whole operation idempotent: re-running a failed pipeline never duplicates or corrupts documents. The implementation below submits the task, polls it to completion with bounded backoff, and reuses an in-flight task id across retries instead of launching a second copy.

import time
import logging
from elasticsearch import Elasticsearch
from elasticsearch.exceptions import ApiError, ConnectionError

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)


def execute_idempotent_reindex(
    es: Elasticsearch,
    source_index: str,
    target_index: str,
    batch_size: int = 5000,
    rps: float = 2000.0,
    max_retries: int = 5,
) -> dict:
    """Submit an ILM-safe reindex and poll it to completion.

    Idempotency comes from op_type="create": documents that already exist on
    the target (e.g. from a prior interrupted run) are skipped, not rewritten.
    Elasticsearch assigns the task id; there is no caller-supplied task_id.
    """
    task_id = None
    for attempt in range(max_retries):
        try:
            # Async submit so a large migration never trips the HTTP timeout.
            response = es.reindex(
                source={"index": source_index, "size": batch_size},
                dest={"index": target_index, "op_type": "create"},
                conflicts="proceed",
                requests_per_second=rps,
                wait_for_completion=False,
            )
            task_id = response["task"]
            logger.info("Submitted reindex task %s; polling for completion.", task_id)
            return _poll_task(es, task_id)
        except (ConnectionError, ApiError) as exc:
            backoff = min(2 ** attempt, 30)
            logger.warning("Attempt %d failed: %s. Retrying in %ss.", attempt + 1, exc, backoff)
            # If a prior attempt already registered a task, reattach instead of
            # launching a duplicate copy.
            if task_id is not None:
                try:
                    existing = es.tasks.get(task_id=task_id)
                    if existing.get("completed"):
                        return existing
                except ApiError:
                    pass
            time.sleep(backoff)

    raise RuntimeError(f"Reindex pipeline failed after {max_retries} attempts.")


def _poll_task(es: Elasticsearch, task_id: str, interval: int = 10) -> dict:
    """Poll a reindex task with capped backoff and a hard timeout."""
    deadline = time.monotonic() + 3600  # 1 hour ceiling
    backoff = interval
    while time.monotonic() < deadline:
        try:
            task = es.tasks.get(task_id=task_id)
            if task.get("completed"):
                logger.info("Task %s completed.", task_id)
                return task
        except ApiError as exc:
            logger.error("Task polling failed: %s", exc)
        time.sleep(backoff)
        backoff = min(backoff * 2, 60)

    raise TimeoutError(f"Reindex task {task_id} exceeded its timeout threshold.")

The op_type: "create" directive is what guarantees safe restarts. For clusters that require version-aware conflict resolution rather than skip-on-exist, implement a script-based handler as described in resolving document conflicts during reindex. If a running task is too aggressive or too slow, it can be re-throttled live without cancellation via es.reindex_rethrottle(task_id=task_id, requests_per_second=500) — the same underlying POST /_reindex/<task_id>/_rethrottle call the async workflow depends on.

Throttling, Batch Sizing & Conflict Handling

Unthrottled reindexes saturate bulk thread pools, trip circuit breakers, and degrade query latency for active workloads. Calibrate requests_per_second against node count, JVM heap headroom, and disk I/O; when migrating across heterogeneous or cloud-managed tiers, derive it dynamically from GET /_nodes/stats/thread_pool so the pipeline backs off before a node’s write queue fills. Batch sizing pulls in the opposite direction: oversized source.size batches inflate coordinating-node heap and risk a circuit_breaking_exception, while undersized batches multiply scroll round-trips and stretch the migration window. Empirical baselines for tuning scroll duration and batch size against document payload complexity are collected in optimizing reindex thresholds and bulk sizes. For petabyte-scale migrations, partition the source by time range or routing value and drive independent tasks from a queue rather than one monolithic scroll — the coordination patterns for that live in designing batch reindex workflows.

Monitoring & Observability

A reindex is opaque unless you instrument it. The _tasks API is the primary signal: GET /_tasks/<task_id> exposes total, created, updated, deleted, version_conflicts, and requests_per_second, from which a dashboard can compute throughput and ETA and an alert can fire when version_conflicts climbs unexpectedly. Wiring these counters into your CI/CD pipeline — so a failed migration triggers automated rollback rather than silent data divergence — is detailed in tracking reindex progress and performance.

The surrounding cluster tells you whether the reindex is safe to continue:

GET /_cat/tasks?detailed&v — live reindex tasks with their running time and progress.
GET /_cat/thread_pool/write,search?v&h=node_name,name,active,queue,rejected — rejections here mean you are over-throttling the deployment; lower requests_per_second.
GET /_cat/allocation?v and GET /_cluster/allocation/explain — why the target’s new shards are unassigned or stuck relocating.
GET /<target>/_ilm/explain — post-cutover confirmation that the target is managed and in the expected phase.

Alert on three composite signals: a write thread-pool rejected count that is rising, a target index whose shard count remains UNASSIGNED after provisioning, and a reindex task whose created counter has flatlined while wait_time_in_queue grows — each maps to a distinct remediation in the table below.

Post-Cutover Validation & Optimization

Once the atomic alias swap completes, restore production settings and prepare the target for lifecycle-driven aging:

Restore replicas. Set number_of_replicas back to the policy’s value. Elasticsearch begins allocation immediately; watch cluster.routing.allocation.disk.watermark.low so replica shards do not push a data node past its watermark.
Reset the refresh interval. Revert refresh_interval to 1s (or your baseline) to restore near-real-time search visibility.
Verify lifecycle state. Run GET /<target>/_ilm/explain and confirm the expected phase and armed action. A misaligned policy silently bypasses retention.
Decommission the source. After an observation window, delete the source index or make its index.blocks.write permanent, retaining a read-only alias where audit requirements demand it.

Freshly built indices have cold filesystem and query caches, so the first production queries against them pay elevated latency until segments and filter caches populate. Replaying a representative set of synthetic queries before the alias handles full traffic warms those caches to baseline — the cache warming strategies for new indices reference covers query replay and controlled traffic ramps for exactly this window.

Replaying representative queries at stage 4 pulls the fresh index's cold-cache latency down to baseline before it takes full traffic.

Common Failure Modes

Symptom	Root cause	Remediation
Target index not managed by ILM after cutover	`index.lifecycle.name` was omitted at creation, so the index inherited defaults	Apply the policy with `PUT /<target>/_settings`, then confirm via `GET /<target>/_ilm/explain`
Reindex stalls at 0 documents; new shards `UNASSIGNED`	Target `require.data` attribute targets a tier with no free capacity	Inspect `GET /_cluster/allocation/explain`; free disk or relax the allocation filter
`circuit_breaking_exception` on the coordinating node	`source.size` batch too large for available heap	Lower `source.size` and re-run; the task resumes safely under `op_type: create`
Rising `write` thread-pool `rejected` count, query latency spikes	`requests_per_second` set above cluster capacity	Re-throttle live with `reindex_rethrottle`; do not cancel the task
`version_conflicts` climbing during the copy	Concurrent writes to the target, or duplicate IDs from a prior run	Ensure the source is write-locked; rely on `conflicts: "proceed"` and script-based conflict handling
Writes rejected immediately after alias swap	Alias added without `is_write_index: true`, leaving no write target	Re-issue the `_aliases` action setting `is_write_index: true` on the target
Elevated query latency on the new index for minutes after cutover	Cold filesystem and filter caches on fresh segments	Warm caches with representative query replay before full traffic

Frequently Asked Questions

Does reindex automatically copy the source index's ILM policy to the target?

No. _reindex copies documents, never settings or lifecycle metadata. The target adopts a policy only if index.lifecycle.name and index.lifecycle.rollover_alias are set on it — ideally at creation, or immediately afterward with a _settings update. Verify with GET /<target>/_ilm/explain; an index reporting "managed": false will ignore your retention rules.

Can I reindex into an index that is actively served by a write alias?

Not safely for a zero-downtime cutover. Provision a fresh, empty target, reindex into it while the source still serves writes through the alias, and move the alias in a single atomic _aliases action once the copy is verified. Writing into a live index invites version conflicts and leaves no clean rollback point.

How do I make a reindex resumable after a network failure?

Submit it asynchronously (wait_for_completion=False) and use op_type: "create" on the destination. If the pipeline dies, re-running it skips already-copied documents instead of duplicating them, so the operation is naturally idempotent. Keep the returned task id to reattach with es.tasks.get(task_id=...) before assuming a restart is needed.

Should I reindex or shrink to move data to the warm tier?

Use shrink when you only need to reduce primary shard count and the mapping is unchanged — it is far cheaper because it hard-links segments rather than re-reading every document. Reindex when you must change the mapping, analyzer, or routing, or consolidate multiple sources. A reindex is a full document copy and should be throttled and scheduled accordingly.

How do I throttle a running reindex without restarting it?

Call es.reindex_rethrottle(task_id=task_id, requests_per_second=<new_rate>), which maps to POST /_reindex/<task_id>/_rethrottle. The change takes effect on the next batch; set a lower value to relieve an overloaded cluster or -1 to remove throttling entirely. There is no need to cancel and resubmit.

Designing batch reindex workflows — partitioning large migrations across queued, independent tasks.
Optimizing reindex thresholds & bulk sizes — tuning scroll size and requests_per_second against cluster capacity.
Resolving document conflicts during reindex — version-aware and script-based conflict handling.
Tracking reindex progress & performance — reading the _tasks API and wiring alerts and rollback.
Cache warming strategies for new indices — eliminating cold-start latency after cutover.
ILM architecture & fundamentals and ILM policy design & lifecycle synchronization — the lifecycle domains a reindex must stay in step with.

← Back to ILM & Reindexing home

Automated Reindexing Pipelines & Workflows #

Reindex Workflow at a Glance #

Core Architecture: Node Roles & Reindex Settings #

Reindex Workflow Mechanics #

Security & Governance #

Production Automation with Python v8+ #

Throttling, Batch Sizing & Conflict Handling #

Monitoring & Observability #

Post-Cutover Validation & Optimization #

Common Failure Modes #

Frequently Asked Questions #

Related #

Explore deeper