Tuning Bulk Request Size for High-Throughput Reindexing

High-throughput reindexing is only deterministic when the bulk request size is treated as a dynamic control variable — adjusted against live cluster pressure — rather than a static number baked into a script.

Getting this one knob wrong is what turns a routine copy into an incident: a batch that is comfortable on an idle cluster trips the parent circuit breaker under concurrent ingest, saturates the write thread pool, materializes scroll contexts and bulk buffers into an OutOfMemoryError, and stalls the target’s rollover conditions mid-migration. This page is the per-request sizing detail under Optimizing Reindex Thresholds & Bulk Sizes — where that broader guide calibrates requests_per_second and shard sizing, this one answers the narrow question of how many documents, and how many bytes, to put in each bulk request and how to change that answer while the copy is running.

Prerequisites

Elasticsearch 8.x cluster with dedicated data nodes and monitoring access to GET _cat/thread_pool/write, GET _nodes/stats/breaker, and GET _tasks.
elasticsearch-py v8.0+ — this page uses the v8 client surface (helpers.scan, client.bulk, keyword-argument requests), not the legacy body= pattern.
The target index provisioned with number_of_replicas: 0 and refresh_interval: -1 for the copy, with its lifecycle policy and rollover alias attached at creation time.
A staging index whose documents match production byte size — sizing against 2 KB docs is meaningless if production averages 40 KB.

Establish a diagnostic baseline first

Never tune bulk size against a deployment you have not measured. Capture write-pool saturation and breaker headroom before the first batch so every later reading is a delta, not a guess:

# 1. Routing must be stable — rebalancing steals the write I/O you are about to measure.
GET _cluster/health?wait_for_status=green&timeout=10s

# 2. Write-pool pressure as JSON (omit ?v, which only formats the text output).
GET _cat/thread_pool/write?format=json&h=node_name,active,queue,rejected,completed

# 3. Breaker limit vs. estimated usage — your byte ceiling has to live under this.
GET _nodes/stats/indices/breaker?filter_path=nodes.*.indices.breaker.parent.*

A healthy starting point looks like this:

{
  "cluster_name": "prod-search-01",
  "status": "green",
  "relocating_shards": 0,
  "initializing_shards": 0,
  "unassigned_shards": 0
}

If status is yellow or relocating_shards is above 0, stop. Shard rebalancing consumes the same write I/O the reindex needs and will inflate rejected counts, making the bulk size look too large when the real culprit is allocation.

Symptom-to-cause mapping

Reindex failures are not random; they fire at deterministic thresholds tied to heap, the write queue, and ILM evaluation cycles. Map the symptom to its diagnostic and its containment before you change anything:

Symptom	Root cause	Diagnostic	Immediate fix
`es_rejected_execution_exception` on the `write` pool	Batch bytes saturate the bounded write queue	`GET _cat/thread_pool/write?format=json&h=node_name,queue,rejected`	Cut batch to 500–1000 docs (~5 MB); add `requests_per_second` throttle
`circuit_breaking_exception [parent]`	In-flight bulk bytes + fielddata exceed the breaker	`GET _nodes/stats/breaker?filter_path=*.parent`	Cap `max_chunk_bytes` at `10_000_000`; set `refresh_interval: -1` on target
`OutOfMemoryError: Java heap space`	Scan `size` too large, or too many open scroll contexts	`GET _tasks?detailed=true&actions=*reindex`	Lower `helpers.scan` `size`; shorten `scroll` to `"2m"`
ILM `rollover` stuck in `WAITING`	Slow bulk writes delay `max_age`/`max_docs` evaluation	`GET <target-index>/_ilm/explain`	Throttle the copy; inspect the stuck step

The recurring lesson across every row: size the batch by serialized bytes, not by document count. A 500-document batch of 2 KB events is a harmless 1 MB request; the same 500 documents at 40 KB each is a 20 MB request that trips the breaker before serialization even finishes.

Implementation: adaptive, byte-bounded bulk sizing

The server-side _reindex API exposes a single batch knob — source.size — which governs both the scroll read batch and the bulk write batch together; there is no separate read/write split. That is enough when you only need a straight copy:

POST /_reindex?wait_for_completion=false&requests_per_second=2000&slices=auto
{
  "source": { "index": "events-2026-000001", "size": 1000 },
  "dest":   { "index": "events-2026-000002", "op_type": "create" },
  "conflicts": "proceed"
}

When you need the read batch and the write batch sized independently — or you want the batch to react to rejections in real time — drive the copy from the client. The script below reads with helpers.scan (a single scroll cursor; the v8 helper has no slices argument) and flushes with client.bulk, flushing on whichever comes first: the current document count or a hard byte ceiling. After each flush it re-reads the deployment’s write rejections and adjusts the batch — shrinking 25% on any rejection, creeping back up after several clean flushes.

import json
import logging
import time
from elasticsearch import Elasticsearch, ApiError, helpers

logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
log = logging.getLogger("adaptive_reindex")


def cluster_write_rejected(client: Elasticsearch) -> int:
    # _cat/thread_pool rendered as JSON; sum `rejected` across every write pool.
    rows = client.cat.thread_pool(thread_pool_patterns="write", format="json", h="rejected")
    return sum(int(r["rejected"]) for r in rows)


def adaptive_reindex(
    client: Elasticsearch,
    source_index: str,
    dest_index: str,
    size: int = 500,                    # starting docs per bulk request
    min_size: int = 100,
    max_size: int = 4000,
    max_chunk_bytes: int = 15_000_000,  # 15 MB hard ceiling per request
) -> int:
    """Copy source -> dest, shrinking the bulk batch when the write pool rejects
    and growing it when pressure clears. op_type 'create' makes re-runs idempotent:
    already-copied _ids come back as 409s and are skipped, not overwritten."""
    processed = 0
    prev_rejected = cluster_write_rejected(client)
    ops: list[dict] = []
    doc_count = byte_count = clean_flushes = 0

    def flush() -> None:
        nonlocal ops, doc_count, byte_count, size, prev_rejected, processed, clean_flushes
        if not ops:
            return
        try:
            # client.bulk returns a response dict in v8 — never a tuple.
            resp = client.bulk(operations=ops, refresh=False)
        except ApiError as e:
            if e.status_code == 429:  # parent breaker or write-queue saturation
                size = max(min_size, size // 2)
                log.warning("429 on bulk; halving to size=%d and backing off 10s", size)
                time.sleep(10)
                resp = client.bulk(operations=ops, refresh=False)  # retry same batch
            else:
                raise

        if resp.get("errors"):
            hard = [i for i in resp["items"]
                    if next(iter(i.values())).get("status", 200) not in (200, 201, 409)]
            if hard:
                log.warning("%d non-conflict failure(s) in batch of %d", len(hard), doc_count)
        processed += doc_count

        now_rejected = cluster_write_rejected(client)
        if now_rejected > prev_rejected:
            size = max(min_size, int(size * 0.75))      # back off 25% on any rejection
            clean_flushes = 0
            log.info("rejections +%d -> size=%d", now_rejected - prev_rejected, size)
        else:
            clean_flushes += 1
            if clean_flushes >= 5 and size < max_size:
                size = min(max_size, int(size * 1.10))  # creep up after 5 clean flushes
                clean_flushes = 0
        prev_rejected = now_rejected
        ops, doc_count, byte_count = [], 0, 0

    for hit in helpers.scan(
        client, index=source_index, query={"query": {"match_all": {}}},
        scroll="5m", size=1000, request_timeout=60,   # scan `size` is independent of bulk `size`
    ):
        header = {"create": {"_index": dest_index, "_id": hit["_id"]}}
        source = hit["_source"]
        ops.extend((header, source))
        doc_count += 1
        byte_count += len(json.dumps(header)) + len(json.dumps(source))
        if doc_count >= size or byte_count >= max_chunk_bytes:
            flush()
    flush()  # trailing partial batch

    log.info("reindex complete: %d documents copied to %s", processed, dest_index)
    return processed


if __name__ == "__main__":
    es = Elasticsearch("https://prod-cluster:9200", api_key="REDACTED", verify_certs=True)
    adaptive_reindex(es, "events-2026-000001", "events-2026-000002")

The byte accumulator is what keeps this safe across heterogeneous document sizes: whether the source holds 2 KB events or 40 KB nested documents, no request crosses max_chunk_bytes, so the parent breaker never sees an oversized coordinating-node payload.

Verification

Confirm the copy is progressing and staying inside its limits — not merely that it started. Keep two counters open while it runs:

GET _cat/thread_pool/write?v&h=node_name,active,queue,rejected
GET _nodes/stats/breaker?filter_path=*.parent

rejected should hover near 0, and the parent breaker’s tripped count must not increment. When the copy finishes, prove parity and lifecycle adoption before decommissioning the source:

GET /events-2026-000002/_count
GET /events-2026-000002/_ilm/explain

_count on target and source must match (allowing for documents you intentionally dropped). _ilm/explain must report the index as managed:

{
  "indices": {
    "events-2026-000002": {
      "managed": true,
      "policy": "events-ilm",
      "phase": "hot",
      "action": "rollover",
      "step": "check-rollover-ready"
    }
  }
}

A target reporting "managed": false copied documents but never adopted your policy — it will ignore retention and never roll over. Live progress instrumentation for the running task is covered under tracking reindex progress and performance.

Gotchas and edge cases

source.size on _reindex is a single knob. It sizes the scroll read and the bulk write together. If you need them tuned separately (large reads, small writes, or vice versa), you must drive the copy client-side as shown above — there is no scroll_size/write-size split on the server-side API.
Bytes trip the breaker, documents do not. Always cap with max_chunk_bytes. A batch that is fine at 500 small documents becomes a 20 MB breaker trip the moment document size grows; let the byte ceiling, not the doc count, be the hard limit.
op_type: "create" returns 409s on re-run, not errors to fix. Idempotent retries surface already-copied documents as version conflicts. Filter status 409 out of your failure count (as the script does) or you will chase phantom failures on every resume.
Rethrottle live instead of cancel-and-resubmit. A running _reindex accepts POST /_reindex/<task_id>/_rethrottle?requests_per_second=… — pace it down under a spike without losing progress rather than killing and restarting the task.
A stuck WAITING rollover after a large copy is usually write pressure, not policy error. Confirm with _ilm/explain; if the step is healthy but blocked, throttle the reindex. Persistent stalls belong under monitoring ILM execution and error states.

FAQ

What bulk request size should I start with?

Start at 500 documents and let the byte ceiling do the real limiting. Target a serialized batch of 5–15 MB: with 2–5 KB documents that is roughly 1,000–4,000 docs, and with 40 KB documents it is only a few hundred. Set max_chunk_bytes to 15_000_000 so the batch is capped by bytes rather than count, then adapt the document count downward whenever the write pool starts rejecting.

Why does a batch that worked yesterday trip the circuit breaker today?

Bulk size interacts with everything else on the node. The same request that succeeded on an idle cluster adds its bytes on top of fielddata, segment merges, and concurrent ingest, and the sum crosses the parent breaker limit. Bulk size must scale inversely with concurrent write load — size it for your busiest window, not an empty one, and cap it by bytes with max_chunk_bytes.

Should I raise the scroll read size to make the copy faster?

No. helpers.scan opens a single scroll cursor regardless of its size; a larger size only raises per-fetch heap cost and pushes you toward OutOfMemoryError. Read size and write size are independent — keep the scan size modest (around 1,000) and tune the bulk write batch separately. Parallelism comes from slices on the _reindex API, not from a bigger scroll.

How do I recover a bulk reindex that failed halfway?

Use op_type: "create" (as the adaptive script does) so a re-run is idempotent: documents already written come back as 409 conflicts and are skipped rather than duplicated or overwritten. Persist a checkpoint of the last processed position to an external store if you want to resume mid-scroll instead of rescanning from the start, and filter status 409 out of your error count so expected conflicts do not read as failures.

Optimizing reindex thresholds & bulk sizes — the parent topic that calibrates requests_per_second, slices, and shard sizing around this per-request detail.
Resolving document conflicts during reindex — handling the 409s that op_type: "create" surfaces on idempotent re-runs.
Tracking reindex progress & performance — reading the _tasks API to watch throughput and rethrottle a running copy.
Designing batch reindex workflows — wrapping the tuned copy in a resumable, alias-safe state machine.

← Back to Optimizing Reindex Thresholds & Bulk Sizes · Up to Automated Reindexing Pipelines & Workflows

Tuning Bulk Request Size for High-Throughput Reindexing #

Prerequisites #

Establish a diagnostic baseline first #

Symptom-to-cause mapping #

Implementation: adaptive, byte-bounded bulk sizing #

Verification #

Gotchas and edge cases #

FAQ #

Related #