How do I find the reindex task id if I did not save it from the response?

Run GET _tasks?actions=*reindex&detailed=true in the console. It lists every running reindex with its full node_id:task_number id, target index, and current counters. Match on the destination index name in the description field, then use that id for the task-specific GET _tasks/ queries.

Monitoring Reindex Task Status with Kibana Dev Tools

When a production _reindex stalls, throttles, or starts logging conflicts, you need to read its state directly from the Kibana Dev Tools console — this page is the exact sequence of _tasks queries that tells you what the copy is doing and how to intervene without leaving the console.

Where this fits in the reindex lifecycle

This is the hands-on console counterpart to tracking reindex progress and performance: that parent topic turns the _tasks API into an automated control loop, while this page covers the manual, interactive path an on-call engineer takes in Kibana Dev Tools when something looks wrong mid-flight. It belongs to the broader automated reindexing pipelines and workflows domain. Passive Kibana dashboards aggregate on a delay and hide the shard-level failures array that actually explains a stall, so for live triage you query the task-specific endpoint directly and act on its raw counters. The console is also where you confirm a fix took effect before handing control back to automation.

Prerequisites

Elasticsearch 8.x with Kibana, and Dev Tools console access under Management → Dev Tools.
A service account or role scoped by RBAC with cluster-level monitor to read _tasks and manage on the target index pattern to rethrottle or reroute.
The task id from the submission response — a reindex started with wait_for_completion=false returns { "task": "<node_id>:<task_number>" }. That colon-joined string is the id every query below expects.
If you lost the id, recover it with GET _tasks?actions=*reindex&detailed=true before proceeding.

Atomic task inspection and baseline diagnostics

Reliable monitoring starts by querying the task lifecycle directly rather than polling _cat/tasks snapshots. Paste the task-specific endpoint into the console to retrieve atomic progress metrics:

GET _tasks/<task_id>

The response payload contains the state indicators that matter. Compare status.total against status.created and status.updated to calculate real-time throughput. A divergence where status.total stays static while status.created plateaus indicates a source-scan bottleneck — typically heavy segment merging, a max_result_window constraint, or an aggressive refresh_interval on the source. If status.failures increments, inspect the error array immediately; common culprits are version_conflict_engine_exception or mapper_parsing_exception. For granular batch analysis, append ?detailed=true to expose the underlying bulk requests and exact document routing failures:

GET _tasks/<task_id>?detailed=true&human

Establish baseline throughput thresholds so automated alerts fire before degradation reaches downstream consumers — the counter-to-threshold mapping lives in the parent topic, tracking reindex progress and performance.

Thread-pool exhaustion and dynamic throttling

A task reporting running: true with zero progress for more than ~300 seconds usually signals thread-pool exhaustion or a circuit-breaker trip. Verify the search and write thread pools from the console:

GET _nodes/stats/thread_pool/search,write

If queue and rejected counters spike, the reindex is self-throttling against the bulk queue. Adjust requests_per_second dynamically through the rethrottle endpoint — note the value is a query parameter, not a request body:

POST _reindex/<task_id>/_rethrottle?requests_per_second=500

Simultaneously inspect the circuit-breaker state:

GET _nodes/stats/breaker

If the fielddata or request breaker limits are breached, the node rejects bulk operations. Reduce requests_per_second to 100–200 and watch indices.breaker.parent.limit_size_in_bytes against estimated_size_in_bytes. Do not raise or bypass circuit breakers — they exist to prevent OOM crashes, and the correct response to a trip is less concurrency, not more headroom. Deeper calibration of these knobs is covered in tuning bulk request size for high-throughput reindexing.

Cluster state verification and safe manual reroutes

Mid-reindex shard-allocation failures require immediate cluster-state inspection. Run these diagnostics in sequence from the console:

GET _cluster/health

Expected output (degraded state):

{
  "cluster_name": "prod-logs-cluster",
  "status": "yellow",
  "timed_out": false,
  "number_of_nodes": 5,
  "active_shards": 142,
  "active_primary_shards": 71,
  "unassigned_shards": 3,
  "delayed_unassigned_shards": 0,
  "number_of_pending_tasks": 0,
  "task_max_waiting_in_queue_millis": 0,
  "active_shards_percent_as_number": 97.9
}

If unassigned_shards > 0, isolate the cause:

GET _cluster/allocation/explain

Look for ALLOCATION_FAILED or NODE_LEFT triggers. For an ILM-managed target index, confirm the lifecycle policy is not itself stuck — a target that stalls its own rollover step masquerades as a reindex problem:

GET <target_index>/_ilm/explain

Expected output (ILM waiting on rollover):

{
  "indices": {
    "logs-prod-2024.01": {
      "index": "logs-prod-2024.01",
      "managed": true,
      "policy": "logs-policy",
      "lifecycle_date_millis": 1706745600000,
      "phase": "hot",
      "action": "rollover",
      "step": "check-rollover-ready",
      "step_info": {
        "message": "index has exceeded [max_primary_shard_size=50gb] - will rollover"
      }
    }
  }
}

If replicas remain unassigned because of disk pressure or node constraints, execute a safe manual reroute. Do not use accept_data_loss: true unless explicitly authorized by platform engineering — that flag is only for allocate_stale_primary, and using it discards data on the affected shard:

POST _cluster/reroute
{
  "commands": [
    {
      "allocate_replica": {
        "index": "<target_index>",
        "shard": 0,
        "node": "data-node-03"
      }
    }
  ]
}

Conflict handling and emergency bypass

The _reindex API defaults to abort on the first conflict. An emergency schema migration may need "conflicts": "proceed" in the original submission body to keep the task alive through version_conflict_engine_exception, but that survivor set demands strict post-migration reconciliation. To preserve ordering across a copy, use explicit script overrides or version_type: "external" on the destination. Never ship conflicts: proceed to production without a validated reconciliation script and change-management approval — the mechanics of doing this safely are in handling version conflicts with external versioning.

Automated recovery with elasticsearch-py v8+

Interactive console work is right for triage, but a sustained outage needs automation. The orchestrator below polls the task, validates cluster health before acting, applies dynamic throttling, and executes a safe reroute — using only the elasticsearch-py v8+ API surface (tasks.get, reindex_rethrottle, cluster.health, cluster.reroute) with exponential backoff on transient failures.

import logging
import time
from elasticsearch import Elasticsearch
from elasticsearch import ApiError, ConnectionError

logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
logger = logging.getLogger("reindex_recovery")


class ReindexRecoveryOrchestrator:
    def __init__(self, es_host: str, api_key: str, task_id: str):
        self.es = Elasticsearch(hosts=[es_host], api_key=api_key, verify_certs=True)
        self.task_id = task_id
        self.max_retries = 5
        self.backoff_factor = 2

    def poll_task_state(self) -> dict:
        """Retrieve atomic task status with per-batch failure detail."""
        return self.es.tasks.get(task_id=self.task_id, detailed=True)

    def apply_dynamic_throttle(self, rps: int) -> bool:
        """Retune a running reindex — reindex_rethrottle only; there is no tasks.update."""
        try:
            self.es.reindex_rethrottle(task_id=self.task_id, requests_per_second=rps)
            logger.info("Throttle adjusted to %d rps for task %s", rps, self.task_id)
            return True
        except ApiError as exc:
            logger.error("Throttle adjustment failed: %s", exc)
            return False

    def verify_cluster_health(self) -> bool:
        """Refuse to act while the cluster is red or has unassigned shards."""
        health = self.es.cluster.health(wait_for_status="yellow", timeout="30s")
        if health["status"] == "red":
            logger.critical("Cluster status RED. Halting recovery.")
            return False
        return health["unassigned_shards"] == 0

    def execute_safe_reroute(self, index: str, shard: int, node: str) -> bool:
        """Attempt replica allocation without any data loss."""
        try:
            self.es.cluster.reroute(
                body={
                    "commands": [{
                        "allocate_replica": {
                            "index": index,
                            "shard": shard,
                            "node": node,
                        }
                    }]
                }
            )
            logger.info("Safe reroute executed for %s shard %d on %s", index, shard, node)
            return True
        except ApiError as exc:
            logger.error("Reroute failed: %s", exc)
            return False

    def run_recovery_cycle(self) -> bool:
        """Main loop: poll, guard on health, throttle, and exit on clean completion."""
        for attempt in range(self.max_retries):
            try:
                task = self.poll_task_state()
                status = task["task"]["status"]
                # Mid-flight conflicts surface as status.version_conflicts; document-level
                # failures appear under response.failures once the task completes.
                conflicts = status.get("version_conflicts", 0)
                failures = task.get("response", {}).get("failures", [])

                if conflicts > 0 or failures:
                    logger.warning("Conflicts: %d, document failures: %d", conflicts, len(failures))
                    if not self.verify_cluster_health():
                        time.sleep(self.backoff_factor ** attempt)
                        continue
                    if status.get("throttled_millis", 0) > 5000:
                        self.apply_dynamic_throttle(250)
                elif status.get("total", 0) > 0 and status.get("created", 0) == 0:
                    logger.critical("Source-scan bottleneck detected. Escalating to manual triage.")
                    break

                if task.get("completed"):
                    logger.info("Reindex task completed.")
                    return not failures

                time.sleep(10)

            except ConnectionError:
                logger.error("Cluster connection lost. Retrying...")
                time.sleep(self.backoff_factor ** attempt)

        logger.critical("Recovery cycle exhausted. Manual intervention required.")
        return False


if __name__ == "__main__":
    orchestrator = ReindexRecoveryOrchestrator(
        es_host="https://prod-es-cluster:9200",
        api_key="YOUR_API_KEY",
        task_id="your_task_id_here",
    )
    orchestrator.run_recovery_cycle()

Verification

Confirm every intervention took effect before you hand control back to automation. Re-poll the task and check that the throttle you set is now in force and progress is moving:

GET _tasks/<task_id>?filter_path=task.status.requests_per_second,task.status.created,task.status.batches

Expected output after a successful rethrottle:

{
  "task": {
    "status": {
      "requests_per_second": 500.0,
      "created": 842000,
      "batches": 169
    }
  }
}

A requests_per_second that matches the value you posted confirms the _rethrottle landed; a rising created between two polls confirms the copy is no longer stalled. After a reroute, re-run GET _cluster/health and verify unassigned_shards has dropped to 0.

Gotchas and edge cases

throttled_millis is cumulative, not a rate. It only ever grows, so a single reading tells you nothing — compute the delta against running_time_in_nanos between two polls. A large one-time value early in the run is normal ramp-up, not a live stall.
completed: true is not success. A completed task can still carry per-document rejections under response.failures, and it appears only on the completed payload — while the task runs, watch status.version_conflicts instead. Never gate an alias swap on completed alone.
Rethrottling to a very high value can trip the write queue. Setting requests_per_second=-1 (unlimited) or a large number on a heap-constrained cluster pushes es_rejected_execution_exception on coordinating nodes. Step the value up and re-poll rather than jumping to unlimited.
The Dev Tools timeout hides long tasks, not the task itself. If the console request times out, the reindex keeps running server-side; re-issue GET _tasks/<task_id> rather than resubmitting the copy, which would create a second competing task.

Frequently Asked Questions

How do I find the task id if I did not save it from the reindex response?

Run GET _tasks?actions=*reindex&detailed=true in the console. It lists every running reindex with its full node_id:task_number id, the target index, and current counters. Match on the destination index name in the description field, then use that id for the task-specific GET _tasks/<task_id> queries.

Why does GET _tasks/<task_id> return a 404 while the reindex is clearly still running?

A 404 almost always means the id is malformed — the task id is the colon-joined node_id:task_number string, and dropping the node prefix or using the numeric part alone returns not-found. Re-list with GET _tasks?actions=*reindex to copy the exact id. A genuine 404 after the task was visible means it completed or was cancelled; check .tasks if result persistence is enabled.

Can I cancel a stuck reindex from Dev Tools without losing already-copied documents?

Yes. POST _tasks/<task_id>/_cancel stops the copy and frees bulk-queue resources; documents already written to the target stay there. Because a trackable reindex uses op_type: create, you can resubmit the same copy afterward and it skips the ids already present, so cancelling is safe as long as you have not yet swapped the alias to the partially filled target.

Does rethrottling from the console interrupt the running task?

No. POST _reindex/<task_id>/_rethrottle?requests_per_second=500 retunes the throttle on the live task in place — it does not restart or pause the copy. The new ceiling takes effect immediately, and you can confirm it by re-reading status.requests_per_second on the next poll. Setting requests_per_second=-1 removes throttling entirely.

Tracking Reindex Progress & Performance — the parent topic that turns these same counters into an automated control loop.
Tuning Bulk Request Size for High-Throughput Reindexing — the throughput math behind the requests_per_second values you set here.
Handling Version Conflicts with External Versioning — routing the conflicts the failures array surfaces.
Handling ILM Step Execution Failures Programmatically — when _ilm/explain shows the target stuck rather than the reindex.

← Back to Tracking Reindex Progress & Performance · Automated Reindexing Pipelines & Workflows

Monitoring Reindex Task Status with Kibana Dev Tools #

Where this fits in the reindex lifecycle #

Prerequisites #

Atomic task inspection and baseline diagnostics #

Thread-pool exhaustion and dynamic throttling #

Cluster state verification and safe manual reroutes #

Conflict handling and emergency bypass #

Automated recovery with elasticsearch-py v8+ #

Verification #

Gotchas and edge cases #

Frequently Asked Questions #

Related #

Related in Tracking Reindex Progress