Implementing fallback routing for ISM phase transitions

This guide walks through wiring a deterministic overflow tier into an OpenSearch Index State Management (ISM) policy so that a saturated primary tier degrades an index gracefully instead of stalling its lifecycle mid-transition.

When an ISM policy runs an allocation action to move an index between lifecycle phases, routing allocation failures are the primary cause of stalled indices. Disk watermark breaches, missing node attributes, or conflicting index.routing.allocation rules halt the transition indefinitely, the index lingers in an UNASSIGNED shard state, downstream rollovers back up, and Cross-Cluster Replication (CCR) checkpoints drift. This procedure sits under Fallback Routing Strategies and builds on the broader OpenSearch ISM Architecture & Fundamentals model: it defines a secondary allocation target that activates when primary-tier nodes exceed capacity thresholds, then adds external automation to detect the stall and advance the index into that fallback state.

Prerequisites

Confirm each item before you touch a live policy. A single missing attribute or role silently disables the fallback path.

Tier node roles (data_hot, data_warm, data_cold, data_frozen) are declared per the Node Role Allocation model and verified with _cat/nodes.
Overflow capacity exists on the fallback tier, sized against the ratios in Hot-Warm-Cold Tier Design.
An Index Template v2 sets the creation-time index.routing.allocation.require.* baseline, following Data Tier Routing Patterns.
The automation service account holds FGAC permissions for _plugins/_ism/* and _settings, scoped per Security & Access Boundaries.
Cluster disk watermarks (low, high, flood_stage) are set and understood, since they define when the fallback should fire.

Step-by-step procedure

1. Declare fallback node attributes

Fallback routing relies on explicit, immutable node attributes declared in opensearch.yml. Each data node publishes both a primary tier identifier and a fallback tier identifier; OpenSearch’s cluster routing allocator uses these attributes to evaluate placement constraints during a phase transition.

YAML

# opensearch.yml — warm tier node that also advertises a fallback attribute
node.attr.data_tier: warm
node.attr.fallback_tier: warm_fallback
node.attr.zone: us-east-1a

# Cluster-level routing thresholds
cluster.routing.allocation.disk.watermark.low: 85%
cluster.routing.allocation.disk.watermark.high: 90%
cluster.routing.allocation.disk.watermark.flood_stage: 95%

Restart nodes sequentially to preserve quorum. Gotcha: node.attr.* values are case-sensitive and must be byte-identical across every node in a tier. A single node advertising Warm_Fallback instead of warm_fallback is excluded from the fallback pool without raising an error.

2. Verify attribute propagation

Confirm OpenSearch sees both attributes on every data node before you rely on them:

Shell

curl -s -X GET "localhost:9200/_cat/nodes?v&h=name,ip,attr.data_tier,attr.fallback_tier"

Expected output — every warm node lists both columns populated:

Text

name        ip           attr.data_tier attr.fallback_tier
warm-node-1 10.0.2.11    warm           warm_fallback
warm-node-2 10.0.2.12    warm           warm_fallback

Gotcha: if attr.fallback_tier is blank for any node, the ISM policy engine will reject routing overrides that target it. Fix the opensearch.yml value and restart that node before proceeding.

3. Define the ISM policy with a fallback state

The core implementation is an ISM policy with a dedicated fallback state. ISM does not support conditional transitions based on allocation outcomes natively, so the fallback_warm state is entered by external remediation (Step 4) rather than an automatic condition. The policy simply defines where a stalled index can go.

HTTP

PUT _plugins/_ism/policies/log_fallback_routing
{
  "policy": {
    "description": "Hot-warm transition with deterministic fallback routing",
    "default_state": "hot",
    "states": [
      {
        "name": "hot",
        "actions": [
          { "rollover": { "min_primary_shard_size": "50gb", "min_index_age": "7d" } }
        ],
        "transitions": [
          { "state_name": "warm", "conditions": { "min_index_age": "7d" } }
        ]
      },
      {
        "name": "warm",
        "actions": [
          { "allocation": { "require": { "data_tier": "warm" } } },
          { "shrink": { "num_new_shards": 1 } }
        ],
        "transitions": [
          { "state_name": "cold", "conditions": { "min_index_age": "30d" } }
        ]
      },
      {
        "name": "fallback_warm",
        "actions": [
          { "allocation": { "require": { "fallback_tier": "warm_fallback" } } }
        ],
        "transitions": [
          { "state_name": "cold", "conditions": { "min_index_age": "30d" } }
        ]
      },
      {
        "name": "cold",
        "actions": [
          { "allocation": { "require": { "data_tier": "cold" } } }
        ]
      }
    ]
  }
}

Gotcha: the fallback_warm state keeps the same downstream transition to cold as the normal warm state, so an index that overflows still ages out on schedule. The fallback is a lateral placement change, never a change to the retention timeline defined in Index Lifecycle Basics.

4. Deploy automated detection and remediation

Policy-level states cannot self-trigger on allocation failure, so an external worker closes the loop: it polls the ISM explain API, identifies stuck transitions, nulls the primary require attribute, sets the fallback attribute, then forces the index into fallback_warm. This is the same class of recovery covered in Implementing retry logic for stuck ISM transitions, specialised for routing.

Python

import requests
import time
import logging

logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")

CLUSTER_ENDPOINT = "https://opensearch-master.internal:9200"
AUTH = ("automation-svc", "changeme")   # FGAC-scoped service account
VERIFY = True                            # path to CA bundle for self-signed certs

def get_stuck_indices() -> list:
    """Query the ISM explain API for indices in failed or stuck states."""
    url = f"{CLUSTER_ENDPOINT}/_plugins/_ism/explain"
    response = requests.get(url, auth=AUTH, verify=VERIFY, timeout=10)
    response.raise_for_status()
    data = response.json()

    stuck = []
    # explain returns index names as top-level keys plus a total_managed_indices int.
    for index, state in data.items():
        if not isinstance(state, dict):
            continue
        info = state.get("info", {})
        action = state.get("action", {})
        step = state.get("step", {})
        if (action.get("failed")
                or step.get("status") == "failed"
                or "allocation_failed" in info.get("message", "")):
            stuck.append(index)
    return stuck

def apply_fallback_routing(index: str) -> None:
    """Swap the routing attribute and advance ISM to the fallback_warm state."""
    settings_payload = {
        "index.routing.allocation.require.data_tier": None,          # release primary tier
        "index.routing.allocation.require.fallback_tier": "warm_fallback"
    }
    settings_url = f"{CLUSTER_ENDPOINT}/{index}/_settings"
    resp = requests.put(settings_url, json=settings_payload, auth=AUTH, verify=VERIFY, timeout=10)
    resp.raise_for_status()

    change_url = f"{CLUSTER_ENDPOINT}/_plugins/_ism/change_policy/{index}"
    change_payload = {"policy_id": "log_fallback_routing", "state": "fallback_warm"}
    resp = requests.post(change_url, json=change_payload, auth=AUTH, verify=VERIFY, timeout=10)
    resp.raise_for_status()
    logging.info(f"Applied fallback routing and advanced ISM state for {index}")

if __name__ == "__main__":
    while True:
        try:
            stuck_indices = get_stuck_indices()
            if stuck_indices:
                logging.info(f"Detected {len(stuck_indices)} stuck indices. Applying fallback routing...")
                for idx in stuck_indices:
                    apply_fallback_routing(idx)
            time.sleep(300)   # exceed plugins.index_state_management.job_interval (default 5m)
        except requests.exceptions.RequestException as e:
            logging.error(f"Cluster API request failed: {e}")
            time.sleep(60)

Gotcha: keep the poll interval above plugins.index_state_management.job_interval (default 5 minutes). Polling faster than the ISM job scheduler re-detects the same “stuck” index before the plugin has registered the forced state change, producing duplicate change_policy calls. For production, run this as a Kubernetes CronJob or CI/CD-managed worker rather than a bare loop. The endpoint versioning and authentication requirements are documented in the official OpenSearch ISM API reference.

5. Calibrate the fallback trigger threshold

The fallback should fire while the primary tier still has headroom, so shards relocate before the flood stage forces the tier read-only. Model per-node headroom as

H = 1 - \frac{U_{disk}}{C_{disk}}

where $U_{disk}$ is used bytes and $C_{disk}$ is node capacity. Trigger the overflow while $H > (1 - \text{watermark}_{high})$ — with watermark.high at 90%, that means relocating before free space drops below 10%. Setting the trigger any lower lets the high watermark block primary placement first, which is exactly the stall the fallback exists to prevent.

Verification

After the remediation worker acts, confirm the index actually landed on fallback nodes and progressed.

Confirm the allocator chose fallback nodes:

Shell

curl -s -X POST "localhost:9200/_cluster/allocation/explain" \
  -H "Content-Type: application/json" \
  -d '{"index": "<index_name>", "shard": 0, "primary": true}'

Inspect node_allocation_decisions: the selected node must carry attr.fallback_tier: warm_fallback with "node_decision": "yes".

Check disk pressure on both tiers:

Shell

curl -s -X GET "localhost:9200/_cat/allocation?v&h=shards,disk.indices,disk.used,disk.avail,disk.percent,node"

If primary warm nodes still hover above 90%, expand storage or lower cluster.routing.allocation.disk.watermark.high to 88% before the fallback tier absorbs more load.

Confirm the ISM state advanced:

Shell

curl -s -X GET "localhost:9200/_plugins/_ism/explain/<index_name>"

A healthy result shows state.name: fallback_warm and action.name: allocation with step.status: completed.

Common failures

Symptom	Root cause	Fix command
Index stuck in `warm`, shards `UNASSIGNED`	Warm tier over `high` watermark; strict `require` blocks placement	`PUT /<index>/_settings {"index.routing.allocation.require.data_tier": null, "index.routing.allocation.require.fallback_tier": "warm_fallback"}`
`change_policy` returns success but shards do not move	`fallback_tier` attribute missing/misspelled on target nodes	`GET _cat/nodes?h=name,attr.fallback_tier` then correct `opensearch.yml` and restart
Worker loops on the same index every cycle	Poll interval shorter than `job_interval`; ISM has not re-evaluated yet	Raise loop `sleep` to `300`s; check `GET _cluster/settings` for `plugins.index_state_management.job_interval`
`allocation_failed` persists after fallback	Fallback tier also at `flood_stage`, read-only	`PUT _cluster/settings {"transient":{"cluster.routing.allocation.disk.watermark.flood_stage":"97%"}}` then scale capacity
Follower shards stall after leader falls back	CCR follower lacks matching fallback attributes	Mirror `node.attr.fallback_tier` on the follower cluster, then restart replication

Cross-Cluster Replication routing constraints

When ISM manages leader indices in a CCR topology, fallback routing introduces replication-latency risk. Follower indices inherit routing constraints from the leader cluster, so if the leader transitions to a fallback tier the follower cluster must either mirror the fallback node attributes in its own opensearch.yml, or explicitly override routing on the follower — PUT _plugins/_replication/<follower_index>/_settings where the plugin supports dynamic settings, otherwise stop and restart replication with updated settings.

Failure to synchronise routing attributes across clusters stalls replication checkpoints, because the follower allocator cannot place shards matching the leader’s fallback topology. Always validate follower allocation health with GET _plugins/_replication/<follower_index>/_status after a phase transition.

Fallback Routing Strategies — the overflow-tier design, decision path, and threshold model this procedure implements.
Implementing retry logic for stuck ISM transitions — exponential-backoff _retry automation for the non-routing failure classes.
Node Role Allocation — the tier-to-role mapping the fallback attributes depend on.

Up one level: Fallback Routing Strategies · Foundations: OpenSearch ISM Architecture & Fundamentals

Implementing fallback routing for ISM phase transitions

Prerequisites #

Step-by-step procedure #

1. Declare fallback node attributes #

2. Verify attribute propagation #

3. Define the ISM policy with a fallback state #

4. Deploy automated detection and remediation #

5. Calibrate the fallback trigger threshold #

Verification #

Common failures #

Cross-Cluster Replication routing constraints #

Related guides #

Prerequisites

Step-by-step procedure

1. Declare fallback node attributes

2. Verify attribute propagation

3. Define the ISM policy with a fallback state

4. Deploy automated detection and remediation

5. Calibrate the fallback trigger threshold

Verification

Common failures

Cross-Cluster Replication routing constraints

Related guides