Implementing retry logic for stuck ISM transitions

This guide builds bounded, backoff-driven retry logic around the OpenSearch Index State Management (ISM) _retry API so a phase transition wedged in failed_step recovers automatically without hammering OpenSearch’s cluster-manager into a retry storm.

ISM executes phase actions on an asynchronous background scheduler that offers no native retry guarantees beyond the fixed retry block inside the policy. When a disk watermark breaches, a Cross-Cluster Replication (CCR) follower holds a write lock, or shard allocation stalls, the managed index parks in a failed step and stays there — silently accruing writes and drifting from its intended lifecycle. Recovering a fleet of such indices by hand does not scale, and a naive loop that fires _retry in a tight while overwhelms OpenSearch’s cluster-manager the moment more than a handful of indices are stuck. This procedure wraps the _retry API in deterministic state polling, exponential backoff with jitter, and a circuit breaker, extending the resilient-client patterns of the parent Error Handling & Retries model and the broader ISM Policy Implementation & Python Automation workflow.

Prerequisites

Confirm each item before wiring up automated retries. A retry against an unmet infrastructure constraint just re-queues the same failure and inflates retry_failed_count.

The ISM plugin and opensearch-job-scheduler are installed on every data and cluster-manager node, and you know plugins.index_state_management.job_interval (default 300s) so backoff windows can be tuned to exceed it.
The automation service account holds fine-grained access to _plugins/_ism/explain and _plugins/_ism/retry, scoped per Security & Access Boundaries.
Disk watermarks (low, high, flood_stage) are set for your hardware, since a breached flood_stage raises cluster_block_exception and holds the index read-only before any retry can land.
You can identify whether a target index is a CCR follower — a follower cannot execute rollover, shrink, or delete locally, so retrying it produces an endless replication_conflict.
The intended phase sequence is documented, per Index Lifecycle Basics, so a forced-state retry advances to the correct phase instead of skipping one.

Step-by-step procedure

1. Isolate the stuck indices from the explain API

The explain endpoint is the only authoritative source of ISM execution status; _cat/indices never surfaces it. Query the pattern and read state, action, and step for each managed index:

HTTP

GET _plugins/_ism/explain/<index_pattern>?pretty

Expected output for a transition wedged on rollover:

JSON

{
  "logs-prod-2026.07.04-000001": {
    "index.plugins.index_state_management.policy_id": "logs-hot-warm-cold",
    "state": { "name": "hot", "start_time": 1751600000000 },
    "action": { "name": "rollover", "failed": true },
    "step": { "name": "attempt_rollover", "status": "failed" },
    "retry_failed_count": 3,
    "info": {
      "message": "Rollover failed",
      "cause": "cluster_block_exception … [FORBIDDEN/12/index read-only / allow delete (api)]"
    }
  },
  "total_managed_indices": 1
}

Gotcha: a stuck transition is signalled by action.failed: true or step.status: "failed", never by a state literally named FAILED. Index names are top-level keys sitting alongside the integer total_managed_indices — filter that key out before iterating, or your loop will crash on a non-dict value.

2. Classify the failure signature before retrying

A retry only helps once the underlying block is cleared. Map info.cause to a remediation path and split persistent causes from transient ones:

index_not_found_exception — a missing or misconfigured rollover write alias. Attach the alias, then retry; retrying without it just re-fails.
cluster_block_exception / read-only / allow delete — a flood_stage watermark breach forced the index read-only. Free disk or scale the tier, clear the block with PUT /<index>/_settings {"index.blocks.read_only_allow_delete": null}, then retry.
replication_exception / replication_conflict — the index is a CCR follower holding a write lock. Drive the transition from the leader or stop replication; never retry the follower.
timeout_exception on force_merge / shrink — the action exceeded its window. Raise the action-level timeout (for example "timeout": "1h" on force_merge) rather than retrying at the default.

Gotcha: transient causes (a brief master_not_discovered_exception) clear on their own and are exactly what backoff is for; persistent causes (no node carries the tier attribute) re-fail every retry. Classify before you automate, or you build an infinite loop — the same discipline the sibling handling async ISM policy execution failures walkthrough enforces at detection time.

3. Invoke the `_retry` API to clear the failure flag

Once the block is cleared, do not wait for the next scheduler tick. An empty body re-runs the failed step in the current state and resets the failure flag:

HTTP

POST _plugins/_ism/retry/<index_pattern>

To skip a permanently blocked action and jump straight to a known-good phase, pass an explicit state:

HTTP

POST _plugins/_ism/retry/<index_pattern>
{
  "state": "warm"
}

Expected response — failures: false confirms the flag was cleared and the step re-queued:

JSON

{ "updated_indices": 1, "failures": false, "failed_indices": [] }

Gotcha: the API only re-queues work; execution still happens on the next scheduler tick. A non-empty failed_indices array lists indices that rejected the retry due to active locks, missing aliases, or a policy/index mismatch — those need Step 2 remediation, not another retry. Supplying a state bypasses the blocked action entirely, so use it only when skipping that action is genuinely safe for the lifecycle in Index Lifecycle Basics.

4. Wrap retries in exponential backoff with jitter

Firing _retry in a tight loop across a stuck fleet floods OpenSearch’s cluster-manager with concurrent state writes. Bound each attempt with exponential backoff and uniform jitter so retries spread out and never synchronise into a thundering herd:

Python

import time
import random

def calculate_backoff(attempt: int, base_delay: float = 10.0, max_delay: float = 300.0) -> float:
    """Exponential backoff with uniform jitter to prevent synchronised retry bursts."""
    delay = min(base_delay * (2 ** attempt), max_delay)
    return delay * (0.5 + random.random() * 0.5)  # jitter in [0.5x, 1.0x]

Gotcha: set max_delay at or above job_interval (300s by default). If the backoff ceiling is shorter than a scheduler tick, the loop retries before the plugin has even attempted the re-queued step, so every observation still reads failed and the counter climbs for no reason.

5. Drive the fleet with a bounded, circuit-broken worker

Manual polling does not scale past a handful of indices. This opensearch-py worker polls the explain endpoint, filters to genuinely failed indices, issues a bounded retry per index, and trips a circuit breaker when consecutive rounds make no progress — the safeguard that separates recovery from a self-inflicted retry storm:

Python

import time
import logging
from typing import Dict, List
from opensearchpy import OpenSearch
from opensearchpy.exceptions import TransportError

logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
logger = logging.getLogger("ism_retry_automation")

def retry_stuck_ism_transitions(
    client: OpenSearch,
    pattern: str,
    max_retries: int = 5,
    base_delay: float = 10.0,
    breaker_threshold: int = 3,
) -> Dict[str, List[str]]:
    """Poll ISM explain, retry failed indices, and halt if no progress is made."""
    results: Dict[str, List[str]] = {"succeeded": [], "failed": []}
    stalled_rounds = 0

    for attempt in range(max_retries):
        try:
            explain = client.transport.perform_request(
                "GET", f"/_plugins/_ism/explain/{pattern}"
            )
        except TransportError as e:
            logger.error("Explain request failed: %s", e)
            time.sleep(calculate_backoff(attempt, base_delay))
            continue

        # Index names are top-level keys; total_managed_indices is an int — skip it.
        stuck = [
            idx for idx, d in explain.items()
            if isinstance(d, dict)
            and (d.get("step", {}).get("status") == "failed" or d.get("action", {}).get("failed"))
        ]

        if not stuck:
            logger.info("No stuck transitions remain. Exiting retry loop.")
            break

        progressed = False
        for idx in stuck:
            cause = str(explain[idx].get("info", {}).get("cause", "")).lower()
            # Persistent blocks must be cleared out-of-band, never retried.
            if "replication" in cause or "read-only" in cause:
                logger.warning("Skipping %s — needs manual clearance", idx)
                if idx not in results["failed"]:
                    results["failed"].append(idx)
                continue
            try:
                client.transport.perform_request("POST", f"/_plugins/_ism/retry/{idx}", body={})
                logger.info("Queued retry for %s", idx)
                results["succeeded"].append(idx)
                progressed = True
            except TransportError as e:
                logger.warning("Retry rejected for %s: %s", idx, e)
                results["failed"].append(idx)

        # Circuit breaker: bail out if repeated rounds clear nothing.
        stalled_rounds = 0 if progressed else stalled_rounds + 1
        if stalled_rounds >= breaker_threshold:
            logger.error("Circuit breaker tripped after %d stalled rounds — halting.", stalled_rounds)
            break

        if attempt < max_retries - 1:
            delay = calculate_backoff(attempt, base_delay)
            logger.info("Round %d done; backing off %.1fs.", attempt + 1, delay)
            time.sleep(delay)

    return results

Gotcha: the worker deliberately skips CCR and read-only causes rather than retrying them — those are the classic infinite-loop signatures. Deduplicate succeeded/failed before reporting, and schedule the worker as a Kubernetes CronJob at an interval longer than job_interval so each run observes the effect of the last retry rather than stacking duplicates. Align the retry trigger with real shard pressure using Configuring index size and age thresholds for rollover so actions stop failing in the first place.

6. Clear CCR follower locks before retrying primary transitions

When CCR is active, follower indices hold strict write locks that block primary-side transitions regardless of backoff. Check replication status before issuing any retry:

HTTP

GET _plugins/_replication/<follower_index>/_status

If the follower is SYNCING, defer the retry until replication is PAUSED or stopped:

HTTP

POST _plugins/_replication/<follower_index>/_pause
{ "reason": "draining before ISM transition retry" }

Gotcha: retrying a SYNCING follower burns retries against a lock that only the leader can release. Pause or stop replication first, drive the phase action from the leader, then resume — the split-brain-safe ordering that keeps follower and leader lifecycles from diverging.

Verification

Confirm the retry cleared the failure and the index resumed its lifecycle.

Confirm the step completed and the counter reset:

Shell

curl -s "https://<cluster-endpoint>:9200/_plugins/_ism/explain/logs-prod-*?pretty"

A healthy result shows step.status: "completed" (or an advanced state.name), action.failed: false, and retry_failed_count: 0. A counter still climbing means the Step 2 root cause is not yet cleared and the circuit breaker will soon trip.

Confirm no lingering write block held the retry off:

Shell

curl -s "https://<cluster-endpoint>:9200/logs-prod-2026.07.04-000001/_settings?filter_path=**.blocks"

An empty response is correct. A read_only_allow_delete: "true" means a watermark breach still holds the index read-only and no retry will land until it is cleared.

Confirm the retry burst did not cascade into unassigned shards:

Shell

curl -s "https://<cluster-endpoint>:9200/_cluster/health/logs-prod-2026.07.04-000001?pretty"

Expect "status": "green" and "unassigned_shards": 0. A drop to red during the retry window points at an allocation failure that needs capacity, not another retry.

Common failures

Symptom	Root cause	Fix command
`action.failed: true`, index read-only	`flood_stage` watermark breach set the index read-only	`PUT /<index>/_settings {"index.blocks.read_only_allow_delete": null}` then `POST _plugins/_ism/retry/<index>`
`_retry` returns `failures: true`, `index_not_found_exception`	Missing or misrouted rollover write alias	Attach the write alias, then `POST _plugins/_ism/retry/<index>`
`replication_exception` on a follower, retries never clear	CCR follower cannot execute the phase action locally	`POST _plugins/_replication/<index>/_pause` then drive the transition from the leader
`retry_failed_count` climbs, shards `UNASSIGNED`	No node carries the target tier routing attribute	`GET _cluster/allocation/explain` then add capacity or a fallback attribute before retrying
Retry succeeds but re-fails every tick	Backoff ceiling shorter than `job_interval`; loop retries before the tick runs	Raise `max_delay` to ≥ `job_interval` (≥ 300s) and widen the policy `retry` backoff

Frequently asked questions

Does an empty _retry body re-run the whole state or just the failed step?

An empty body re-executes only the failed step in the current state and clears the failure flag. Passing {"state": "<name>"} skips the blocked action and transitions the index directly to the named state, which is how you bypass a permanently failing action like shrink or force_merge without stalling the lifecycle.

Why does my retry keep failing with the same cluster_block_exception?

The _retry API re-queues the action but does not clear the underlying block. A cluster_block_exception referencing read-only / allow delete is a flood_stage watermark breach — free disk (or raise the watermark temporarily), clear index.blocks.read_only_allow_delete, and only then retry. Retrying before the block is cleared just increments retry_failed_count.

How large should max_delay be relative to the job interval?

Set the backoff ceiling at or above plugins.index_state_management.job_interval (default 300s). ISM only attempts the re-queued step on its next tick, so a shorter ceiling makes the loop poll before the plugin has done any work — every read still shows failed and the retry budget drains for nothing.

Why add a circuit breaker instead of just increasing max_retries?

More retries against a structural block (a tier with no capacity, a follower that cannot roll over) only prolong the storm. A circuit breaker halts after a set number of rounds that clear nothing and alerts a human, converting an infinite loop into a bounded, observable incident. Raise max_retries only for transient causes that backoff can actually outlast.

Handling async ISM policy execution failures — detecting the silent scheduler-tick failures this retry loop recovers from.
Configuring index size and age thresholds for rollover — aligning triggers with capacity so transitions stop wedging.
Python automation for dynamic ISM policy updates — pushing a corrected retry block to already-managed indices.

Up one level: Error Handling & Retries · Automation index: ISM Policy Implementation & Python Automation

Implementing retry logic for stuck ISM transitions

Prerequisites #

Step-by-step procedure #

1. Isolate the stuck indices from the explain API #

2. Classify the failure signature before retrying #

3. Invoke the _retry API to clear the failure flag #

4. Wrap retries in exponential backoff with jitter #

5. Drive the fleet with a bounded, circuit-broken worker #

6. Clear CCR follower locks before retrying primary transitions #

Verification #

Common failures #

Frequently asked questions #

Related guides #

Prerequisites

Step-by-step procedure

1. Isolate the stuck indices from the explain API

2. Classify the failure signature before retrying

3. Invoke the `_retry` API to clear the failure flag

4. Wrap retries in exponential backoff with jitter

5. Drive the fleet with a bounded, circuit-broken worker

6. Clear CCR follower locks before retrying primary transitions

Verification

Common failures

Frequently asked questions

Related guides