Why poll _tasks instead of running the reindex synchronously?

A large copy runs for minutes or hours, and a blocking call pins the client to a long-lived HTTP connection that a proxy will eventually cut, telling you nothing about server-side progress. Submitting with wait_for_completion=false returns a task id immediately; polling GET _tasks/ gives live counters and makes the operation resumable after a client crash.

Tracking Reindex Progress & Performance

A production _reindex is a stateful, resource-intensive data migration that directly affects cluster stability, shard allocation, and Index Lifecycle Management (ILM) policy execution — so tracking its progress is not a passive dashboard exercise but an active control loop. The counters the operation exposes decide when to scale bulk workers, when to throttle against ingestion, and when it is safe to trigger downstream cache warming or an alias cutover. Because a large copy runs asynchronously and returns only a task identifier, the entire migration is observed second-hand: you poll that task, translate its counters into operational thresholds, and act on them. Without precise telemetry a reindex can silently drop documents, saturate the bulk queue, or trip a parent circuit breaker mid-flight. This page is part of the automated reindexing pipelines and workflows topic and focuses on turning raw _tasks output into a signal that drives the pipeline.

Prerequisites

Confirm the following before you submit a task you intend to monitor:

Elasticsearch 8.x cluster with the _reindex, _tasks, and _reindex/<id>/_rethrottle endpoints reachable from the automation host.
elasticsearch-py v8+ installed (elasticsearch>=8.0,<9.0); the async client (AsyncElasticsearch) drives the polling loop below.
The reindex submitted with wait_for_completion=false so it returns a task id instead of blocking a long-lived HTTP connection.
A service account scoped by RBAC with cluster-level monitor (to read _tasks) plus manage on the target index pattern.
A measured baseline of sustained write throughput under peak query load, so requests_per_second targets reflect real capacity rather than a guess — see optimizing reindex thresholds and bulk sizes.
The destination index provisioned from a template that attaches an ILM policy and a rollover_alias, so a clean completion can hand straight off to lifecycle automation.

Architecture: Telemetry as a Control Loop

Progress tracking sits between the batch reindex workflow that submits the task and the alias cutover that promotes the target. The loop is small: submit asynchronously, poll the task status, compare throttled_millis against elapsed time, rethrottle up or down to stay inside the deployment’s headroom, and repeat until the task reports completed. Only a completion with an empty response.failures array is allowed to advance to the alias swap — a failed or partial copy routes to investigation, never to promotion.

Configuration Reference: Submitting a Trackable Task

Effective tracking begins at submission. The _reindex call must return immediately, expose enough progress granularity to reason about throughput, and define conflict behaviour up front so the completion check is meaningful.

POST _reindex?wait_for_completion=false&requests_per_second=50&slices=auto
{
  "source": {
    "index": "logs-2023.01",
    "size": 5000,                  // bulk page size per scroll batch; drives the `batches` counter
    "query": { "range": { "@timestamp": { "gte": "2023-01-01" } } }
  },
  "dest": {
    "index": "logs-2023.01-v2",
    "op_type": "create"            // refuse to overwrite an existing _id — makes a re-run resumable
  },
  "conflicts": "proceed"           // log version conflicts and continue instead of aborting the task
}

Each field has a monitoring consequence:

wait_for_completion=false returns { "task": "<node>:<id>" } immediately — the identifier you poll for the life of the copy.
slices=auto fans the copy into per-shard sub-tasks; each slice reports its own counters, and the parent task aggregates them. This is essential once single-threaded copying becomes the bottleneck on large indices.
requests_per_second is the throttle ceiling. It is not fixed for the run — it is the starting value the control loop tunes with _rethrottle.
conflicts: "proceed" keeps the task alive through version collisions; route the survivors deliberately, as covered in resolving document conflicts during reindex.

Metric Mapping: Reading the Task Status

Once the task is running, GET _tasks/<task_id> returns a task.status object whose fields map directly to operational thresholds. Treat each counter as a signal, not a number:

Metric	Production interpretation	Action trigger
`total`	Estimated document count from the source query	Baseline for the progress-percent calculation
`created` / `updated`	Successfully written documents	Primary throughput indicator
`deleted`	Documents a reindex script marked via `ctx.op = "delete"`	Track lifecycle pruning during the copy
`batches`	Number of bulk requests issued so far	High `batches` with flat `created` = bulk queue saturation
`throttled_millis`	Time spent waiting on the `requests_per_second` ceiling	`> 20%` of elapsed time means the ceiling is too low
`requests_per_second`	The throttle currently in force on the task	Confirms a `_rethrottle` actually took effect
`failures`	Array of shard-level rejections	Any entry demands investigation before promotion

For UI-based spot checks, monitoring reindex task status with Kibana Dev Tools gives a console view of the same counters. Programmatic polling remains mandatory for anything that feeds an automated lifecycle transition.

Step-by-Step: A Self-Tuning Polling Loop in Python v8+

The following async driver submits the task, polls it on a fixed interval, computes real-time throughput, and dynamically adjusts the throttle whenever throttled_millis drifts outside a safe band. It uses only the elasticsearch-py v8+ API surface — es.reindex, es.tasks.get, and es.reindex_rethrottle — and treats both a top-level error and any per-document failure as fatal.

Submit the task with wait_for_completion=False and capture the returned id.
Poll tasks.get on an interval, deriving processed / total for a progress percentage.
Compare throttled_millis to running_time_in_nanos to get a throttle ratio.
Rethrottle down when throttling is high, up when there is headroom.
Gate the exit on completed and an empty response.failures.

import asyncio
import logging
from elasticsearch import AsyncElasticsearch
from elasticsearch.exceptions import ApiError

logging.basicConfig(level=logging.INFO, format="%(asctime)s | %(levelname)s | %(message)s")
logger = logging.getLogger(__name__)

async def orchestrate_reindex(es: AsyncElasticsearch, source_idx: str, dest_idx: str, base_rps: int = 100):
    body = {
        "source": {"index": source_idx, "size": 5000},
        "dest": {"index": dest_idx, "op_type": "create"},
        "conflicts": "proceed",
    }

    # 1. Initiate the async task — returns immediately with a task id.
    init_resp = await es.reindex(
        body=body,
        requests_per_second=base_rps,
        wait_for_completion=False,
        slices="auto",
    )
    task_id = init_resp["task"]
    logger.info(f"Reindex task initiated: {task_id}")

    current_rps = base_rps
    poll_interval = 2.0

    while True:
        try:
            task = await es.tasks.get(task_id=task_id)
            status = task["task"]["status"]
            completed = task.get("completed", False)

            total = status.get("total", 0)
            processed = status.get("created", 0) + status.get("updated", 0) + status.get("deleted", 0)
            throttled = status.get("throttled_millis", 0)
            batches = status.get("batches", 0)

            progress = (processed / total * 100) if total > 0 else 0
            logger.info(
                f"[{task_id}] {progress:.1f}% | "
                f"Processed: {processed}/{total} | Batches: {batches} | "
                f"Throttled: {throttled}ms | RPS: {current_rps}"
            )

            # 5. Gate the exit: completed AND no per-document failures.
            if completed:
                failures = task.get("response", {}).get("failures", [])
                if failures:
                    logger.warning(f"Completed with {len(failures)} failures — inspect response for shard errors.")
                else:
                    logger.info("Reindex task completed cleanly.")
                break

            # 3. Throttle ratio: throttled time as a fraction of elapsed time.
            #    running_time_in_nanos lives on the inner task object, not status.
            elapsed_ms = task["task"]["running_time_in_nanos"] / 1_000_000
            throttle_ratio = throttled / elapsed_ms if elapsed_ms > 0 else 0

            # 4. Rethrottle a running task — there is no tasks.update; use reindex_rethrottle.
            if throttle_ratio > 0.15 and current_rps > 25:
                current_rps = max(25, int(current_rps * 0.8))
                await es.reindex_rethrottle(task_id=task_id, requests_per_second=current_rps)
                logger.info(f"High throttle — reduced RPS to {current_rps}")
            elif throttle_ratio < 0.05 and current_rps < 500:
                current_rps = min(500, int(current_rps * 1.1))
                await es.reindex_rethrottle(task_id=task_id, requests_per_second=current_rps)
                logger.info(f"Headroom available — raised RPS to {current_rps}")

            await asyncio.sleep(poll_interval)

        except ApiError as e:
            if e.status_code == 404:
                logger.warning("Task id not found — verify with es.tasks.list() before assuming loss.")
                break
            raise

# Usage:
# async def main():
#     async with AsyncElasticsearch("https://cluster:9200", api_key="...", verify_certs=True) as es:
#         await orchestrate_reindex(es, "logs-2023.01", "logs-2023.01-v2")
# asyncio.run(main())

Verification Steps

Independent of the polling loop, confirm the copy behaved before promoting anything:

# Human-readable, per-slice detail for a live or finished task.
GET _tasks/<task_id>?detailed=true&human

# Compare document counts between source and target — the authoritative parity check.
GET _cat/indices/logs-2023.01,logs-2023.01-v2?v&h=index,docs.count,store.size

# Confirm the lifecycle policy is bound to the target before handing off.
GET logs-2023.01-v2/_ilm/explain

A clean run shows matching (or intentionally filtered) docs.count, an empty response.failures, and an _ilm/explain output where the target’s managed flag is true and it sits in the expected phase. A mismatch here is the last chance to catch a silent drop before the alias moves.

Threshold Tuning & Bulk Queue Management

Static requests_per_second values rarely survive production traffic spikes. The size parameter in the source dictates bulk payload: for time-series logs, size: 5000 to 10000 usually optimizes network round-trips without triggering es_rejected_execution_exception on coordinating nodes. Watch the relationship between counters — if batches climbs linearly while created/updated stalls, the bulk queue is saturated. Reduce size by roughly 30% and raise the destination refresh_interval to 30s (or -1) for the copy window.

When scaling across data nodes, align slice count with primary shard count. Over-slicing (slices > primary_shards * 2) fragments the bulk queue and increases JVM heap churn without improving throughput. Validate thread-pool pressure during peak windows with GET _nodes/stats/thread_pool, and treat a rising write queue depth as the real ceiling regardless of what requests_per_second is set to. Deeper calibration of these knobs lives in optimizing reindex thresholds and bulk sizes.

Troubleshooting

Stalled or zombie tasks

If processed plateaus for more than ~10 minutes, inspect for shard allocation failures or mapping conflicts, then cancel cleanly to release bulk queue resources if the task is unrecoverable:

GET _tasks/<task_id>?detailed=true&human
POST _tasks/<task_id>/_cancel

Heap pressure and circuit breakers

High throttled_millis combined with parent circuit-breaker trips signals insufficient heap for bulk indexing. Mitigate by lowering requests_per_second to 25–50, restricting allocation to primaries for the copy window, then restoring replicas afterward:

PUT _cluster/settings
{
  "transient": { "cluster.routing.allocation.enable": "primaries" }
}

Re-enable full allocation ("all") once the copy finishes and the target returns to green.

Silent document drops

A non-zero delta between total and processed usually comes from malformed documents or strict mapping enforcement. Inspect response.failures on the completed task — _reindex does not auto-route rejected documents to another index, and ignore_unavailable only skips missing or closed source indices; it never captures a malformed-document error. Fix the destination mapping or pre-process the offending documents with an ingest pipeline, then cross-reference dropped _id values against the source query to isolate the schema drift.

FAQ

Why poll _tasks instead of just running the reindex synchronously and reading the response?

A large copy runs for minutes or hours. A blocking call pins the client to a single long-lived HTTP connection that a proxy or load balancer will eventually cut, and a dropped connection tells you nothing about whether the copy is still progressing server-side. Submitting with wait_for_completion=false returns a task id immediately; polling GET _tasks/<id> gives you live counters and makes the whole operation resumable after a client crash.

How do I change the throttle on a reindex that is already running?

Call reindex_rethrottle(task_id=..., requests_per_second=...) (REST: POST _reindex/<task_id>/_rethrottle). There is no tasks.update for this — reindex_rethrottle is the only way to retune a live task. Set requests_per_second to -1 to remove throttling entirely, or lower it the moment the write thread-pool queue starts backing up. The change applies to the running task without interrupting it.

What throttle ratio should trigger a rethrottle?

Compute throttled_millis / running_time_in_nanos (after converting nanos to millis). Above roughly 15% the throttle is holding the copy back, so step requests_per_second down; below 5% there is headroom, so step it up. Keep a floor (for example 25) and a ceiling so an over-aggressive loop cannot push the deployment into es_rejected_execution_exception. Note that running_time_in_nanos lives on the inner task object, not inside status.

The task reports completed: true — is it safe to swap the alias?

Not on completed alone. A completed task can still carry per-document rejections under response.failures, and a top-level error means the whole operation aborted. Gate the alias swap on completed and an empty response.failures and a document-count parity check via _cat/indices. Only then promote the target and hand off to ILM.

Why does total sometimes differ from the source index document count?

total is the estimate derived from the reindex query, not the raw source size. A range filter, a routing constraint, or documents deleted after the scroll started all make total legitimately smaller than docs.count. Treat an unexplained gap between total and processed as the drop signal, and reconcile total against the query rather than against the whole source index.

Designing Batch Reindex Workflows — the resumable state machine that submits the task you monitor here.
Optimizing Reindex Thresholds & Bulk Sizes — calibrating source.size, requests_per_second, and slices against cluster topology.
Resolving Document Conflicts During Reindex — routing the version collisions your failures array surfaces.
Cache Warming Strategies for New Indices — the downstream step a clean completion triggers.
Monitoring Reindex Task Status with Kibana Dev Tools — the console view of the same counters.

← Back to Automated Reindexing Pipelines & Workflows

Tracking Reindex Progress & Performance #

Prerequisites #

Architecture: Telemetry as a Control Loop #

Configuration Reference: Submitting a Trackable Task #

Metric Mapping: Reading the Task Status #

Step-by-Step: A Self-Tuning Polling Loop in Python v8+ #

Verification Steps #

Threshold Tuning & Bulk Queue Management #

Troubleshooting #

Stalled or zombie tasks #

Heap pressure and circuit breakers #

Silent document drops #

FAQ #

Related #

Explore deeper

Related in Reindexing Pipelines