Best practices for OpenSearch index lifecycle management

This guide turns an ad-hoc, cron-driven index rotation into a deterministic Index State Management (ISM) pipeline that rolls over, retiers, and deletes indices on measurable thresholds without stalled transitions or unassigned shards.

Production log and telemetry clusters fail in predictable ways: implicit transitions fire early, oversized shards jam force_merge, allocation actions relocate nothing because node attributes never loaded, and follower indices under replication fight their leader over rollover. Every one of those failures is preventable with explicit policy structure and pre-flight verification. This procedure builds on the state-machine model in OpenSearch ISM Architecture & Fundamentals and the policy anatomy in Index Lifecycle Basics; read those first if you have not attached a policy before. Below are the exact payloads, an idempotent Python deployer, verification commands, and a failure-mode table for data platform and DevOps engineers.

Prerequisites

Confirm each item before applying any payload to a production cluster. The steps that follow assume these are already true.

Tier node attributes are declared and loaded — each data node carries an explicit node.attr.data value, configured per Node Role Allocation.
The tier hardware and capacity ratios are sized per Hot-Warm-Cold Tier Design so a full hot retention window fits before the first transition fires.
Disk watermarks are tuned for multi-tier hardware — defaults throttle high-capacity HDD cold nodes prematurely.
Your automation account has write access to _plugins/_ism/* and _index_template/*, scoped per Security & Access Boundaries.
A write alias exists on the ingest index (rollover targets aliases, not raw indices).
requests (or opensearch-py) is installed in the automation environment.

Size the rollover shard threshold against the cold-tier disk headroom so a full generation always has somewhere to land:

\text{Max primary shard size (GB)} = \frac{\text{Cold node capacity (GB)} \times (\text{watermark.low} - \text{buffer})}{\text{Shards per node}}

Best-practice configuration, step by step

1. Define explicit states with combined rollover conditions

Never rely on implicit transitions. Every state must declare its own actions and transitions, and the hot-state rollover must combine age, primary-shard size, and document count so a single runaway dimension cannot produce oversized shards or a swarm of tiny indices. The policy below is the production baseline — hot ingest, warm optimization, cold archive, hard delete.

HTTP

PUT _plugins/_ism/policies/log-lifecycle
{
  "policy": {
    "description": "Production log tiering with deterministic rollover",
    "default_state": "hot",
    "states": [
      {
        "name": "hot",
        "actions": [
          {
            "rollover": {
              "min_index_age": "1d",
              "min_primary_shard_size": "30gb",
              "min_doc_count": 50000000
            }
          }
        ],
        "transitions": [
          { "state_name": "warm", "conditions": { "min_index_age": "3d" } }
        ]
      },
      {
        "name": "warm",
        "actions": [
          { "allocation": { "require": { "data": "warm" }, "wait_for": true } },
          { "replica_count": { "number_of_replicas": 0 } },
          { "force_merge": { "max_num_segments": 1 } }
        ],
        "transitions": [
          { "state_name": "cold", "conditions": { "min_index_age": "14d" } }
        ]
      },
      {
        "name": "cold",
        "actions": [
          { "allocation": { "require": { "data": "cold" }, "wait_for": true } },
          { "read_only": {} }
        ],
        "transitions": [
          { "state_name": "delete", "conditions": { "min_index_age": "60d" } }
        ]
      },
      {
        "name": "delete",
        "actions": [ { "delete": {} } ]
      }
    ],
    "ism_template": [{ "index_patterns": ["logs-*"], "priority": 100 }]
  }
}

Expected response:

JSON

{ "_id": "log-lifecycle", "_version": 1, "_primary_term": 1, "_seq_no": 0 }

Gotcha: reduce replicas to 0 before force_merge, not after — merging while replicas exist duplicates the I/O on every copy. Order actions within a state deliberately; ISM runs them top to bottom.

2. Bind the policy through a versioned index template

Attach policies at creation time with the index.plugins.index_state_management.policy_id setting inside a composable _index_template (v2). Version the template and give it a priority above any legacy template so it always wins the merge — this prevents policy detachment during rolling deployments.

HTTP

PUT _index_template/logs_lifecycle_v2
{
  "index_patterns": ["logs-*"],
  "priority": 500,
  "version": 2,
  "template": {
    "settings": {
      "index.plugins.index_state_management.policy_id": "log-lifecycle",
      "index.routing.allocation.require.data": "hot",
      "number_of_shards": 3,
      "number_of_replicas": 1
    }
  }
}

Gotcha: index.lifecycle.name is the Elasticsearch ILM key and OpenSearch ignores it silently — an index bound with it is completely unmanaged. Always retrieve a stored policy with GET _plugins/_ism/policies/log-lifecycle, and inspect an index’s live execution with GET _plugins/_ism/explain/<index>; the two APIs operate on different objects.

3. Align tier allocation with disk watermarks

ISM allocation actions relocate shards only if the target tier has headroom below cluster.routing.allocation.disk.watermark.low. If a tier sits above its watermark, or cluster.routing.allocation.enable is restricted, the action fails silently and the index deadlocks. Align watermarks with real storage capacity so cold HDD nodes are not throttled early. For the deeper decider mechanics, see Data Tier Routing Patterns.

HTTP

PUT _cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.disk.threshold_enabled": true,
    "cluster.routing.allocation.disk.watermark.low": "85%",
    "cluster.routing.allocation.disk.watermark.high": "90%",
    "cluster.routing.allocation.disk.watermark.flood_stage": "95%"
  }
}

Gotcha: when a require filter matches no node with capacity, the transition halts rather than degrading gracefully. Give it a secondary target with Fallback Routing Strategies so a full tier never freezes the whole lifecycle.

4. Deploy the policy idempotently from Python

Policy deployment belongs in CI/CD, not a console. This script creates or updates the policy idempotently, resolves the 409 version conflict with if_seq_no/if_primary_term, and verifies attachment before exiting non-zero on failure. The rollover and threshold values it ships should come from configuring index size and age thresholds for rollover.

Python

import sys
import requests
from urllib3.util.retry import Retry
from requests.adapters import HTTPAdapter

OPENSEARCH_URL = "https://os-cluster.internal:9200"
AUTH = ("admin", "secure_password")
POLICY_ID = "log-lifecycle"
POLICY_PAYLOAD = {
    "policy": {
        "description": "Automated deployment policy",
        "default_state": "hot",
        "states": [
            {"name": "hot", "actions": [{"rollover": {"min_index_age": "1d"}}], "transitions": []}
        ],
        "ism_template": [{"index_patterns": ["logs-*"], "priority": 100}],
    }
}

session = requests.Session()
retries = Retry(total=3, backoff_factor=1, status_forcelist=[429, 500, 502, 503, 504])
session.mount("https://", HTTPAdapter(max_retries=retries))
session.verify = "/etc/ssl/certs/ca-bundle.crt"


def deploy_policy() -> None:
    url = f"{OPENSEARCH_URL}/_plugins/_ism/policies/{POLICY_ID}"
    try:
        resp = session.put(url, auth=AUTH, json=POLICY_PAYLOAD, timeout=10)
        if resp.status_code == 409:
            # Policy exists — fetch seq_no/primary_term and update in place.
            existing = session.get(url, auth=AUTH, timeout=10).json()
            seq_no = existing["_seq_no"]
            primary_term = existing["_primary_term"]
            resp = session.put(
                f"{url}?if_seq_no={seq_no}&if_primary_term={primary_term}",
                auth=AUTH, json=POLICY_PAYLOAD, timeout=10,
            )
        resp.raise_for_status()
        print(f"[SUCCESS] Policy '{POLICY_ID}' deployed.")
    except requests.exceptions.HTTPError as exc:
        print(f"[ERROR] Deployment failed: {exc}")
        sys.exit(1)

    explain = session.get(
        f"{OPENSEARCH_URL}/_plugins/_ism/explain/logs-2024.01.01", auth=AUTH, timeout=10
    )
    if explain.status_code == 200 and explain.json().get("logs-2024.01.01", {}).get("policy_id"):
        print("[SUCCESS] Policy attachment verified on target index.")
    else:
        print("[WARN] Target index not yet attached; check template priority.")


if __name__ == "__main__":
    deploy_policy()

Gotcha: an already-attached index is a success, not an error — treat a matching policy_id in the explain output as idempotent. For transitions that keep failing rather than attaching, escalate to implementing retry logic for stuck ISM transitions.

5. Isolate follower lifecycles under Cross-Cluster Replication

ISM and Cross-Cluster Replication (CCR) operate independently. Rollover must run on the leader; follower indices are write-blocked and need their own read-optimized policy with matching index_patterns — never rollover actions. Attach follower policies by template inheritance rather than replicating ISM state metadata.

HTTP

POST _plugins/_replication/_autofollow
{
  "leader_alias": "prod-cluster",
  "name": "logs-autofollow",
  "pattern": "logs-*",
  "use_roles": {
    "leader_cluster_role": "all_access",
    "follower_cluster_role": "all_access"
  }
}

Gotcha: defer cold-tier transitions on the follower while replication lag is high — retiering an index mid-catch-up risks data divergence. Watch lag with GET _plugins/_replication/follower_stats and hold transitions if it exceeds ~30 seconds.

Verification

Run these checks after the first ISM poll cycle (allow ~5 minutes) to confirm the pipeline is live.

Confirm the managed index is executing its policy.

Shell

curl -s "localhost:9200/_plugins/_ism/explain/logs-2024.01.01"

Expected — the index reports its policy, current state, and a healthy last step. Watch step.status, info.message, and action.consumed_retries:

JSON

{
  "logs-2024.01.01": {
    "index.plugins.index_state_management.policy_id": "log-lifecycle",
    "state": { "name": "hot" },
    "action": { "name": "rollover", "consumed_retries": 0 },
    "step": { "name": "attempt_rollover", "status": "condition_not_met" }
  }
}

Confirm shards landed on the correct tier.

Shell

curl -s "localhost:9200/_cat/shards/logs-2024.01.01?v&h=index,shard,prirep,state,node&s=index"

Every shard should read STARTED on a node in the tier the current state requires. A shard on the wrong tier means the allocation action never completed — check watermarks and node attributes.

Common failure modes

Symptom	Root cause	Fix command
`step.status` stays `failed` for >2 cycles	Transition blocked or policy setting missing	`curl -s -X POST "localhost:9200/_plugins/_ism/retry/logs-2024.01.01"`
Shards `UNASSIGNED` after a warm/cold transition	Target tier missing `node.attr.data` or above `watermark.high`	`curl -s -X POST "localhost:9200/_cluster/allocation/explain" -H 'Content-Type: application/json' -d '{"index":"logs-2024.01.01","shard":0,"primary":true}'`
High `action.consumed_retries` on `shrink`	Insufficient node capacity or shards not collocated	`curl -s "localhost:9200/_plugins/_ism/explain/logs-2024.01.01"` then free capacity or relax `shrink`
Policy silently detached after a template edit	New template lacks `policy_id`; re-created index unmanaged	re-run the idempotent deploy from step 4
Index frozen on a full tier	`require` filter matches no node with headroom	apply a transient watermark override, then revert immediately

To unblock a watermark-frozen index without leaving capacity degraded long-term, raise the thresholds transiently and reset them the moment shards recover:

HTTP

PUT _cluster/settings
{
  "transient": {
    "cluster.routing.allocation.disk.watermark.low": "90%",
    "cluster.routing.allocation.disk.watermark.high": "95%"
  }
}

For a policy stuck in the wrong state, re-drive evaluation with POST _plugins/_ism/change_policy/logs-2024.01.01 and an explicit state. Restrict who can call these endpoints — map cluster:admin/opendistro/ism/policy/* to automation service accounts only, and keep cluster:admin/cluster/settings with infrastructure operators, per Security & Access Boundaries.

Frequently asked questions

Should rollover use age, size, or document count?

Use all three together. min_index_age caps how stale an index gets, min_primary_shard_size caps shard size for query performance, and min_doc_count guards against unbounded document density. ISM rolls over when any condition is met, so combining them prevents both oversized shards and a swarm of tiny indices from a single runaway dimension.

Why did my allocation action complete but move no shards?

The require attribute matched no eligible node. Either the target tier’s node.attr.data value was never loaded, or every candidate node sits above watermark.low. Run _cluster/allocation/explain and inspect the deciders array — a filter decider means an attribute mismatch, a disk_threshold decider means a watermark breach.

How do I update a live policy without a version conflict?

Read the current _seq_no and _primary_term from GET _plugins/_ism/policies/<id>, then PUT with ?if_seq_no=&if_primary_term=. A bare PUT on an existing policy returns 409; the idempotent deployer in step 4 handles this automatically. Managed indices pick up the new definition on their next poll or a manual change_policy call.

Can the same policy manage leader and follower indices under CCR?

No. The leader owns rollover; the follower is write-blocked and would error on any write action. Give followers a separate read-optimized policy — allocation and delete only — attached by template inheritance, and never replicate ISM state metadata directly between clusters.

Configuring index size and age thresholds for rollover — how to pick the rollover numbers this policy ships.
Implementing retry logic for stuck ISM transitions — automated recovery when a transition keeps failing.
How to configure OpenSearch ISM hot-warm-cold architecture — the node, template, and policy setup this guide assumes.

Up: Index Lifecycle Basics

Best practices for OpenSearch index lifecycle management

Prerequisites #

Best-practice configuration, step by step #

1. Define explicit states with combined rollover conditions #

2. Bind the policy through a versioned index template #

3. Align tier allocation with disk watermarks #

4. Deploy the policy idempotently from Python #

5. Isolate follower lifecycles under Cross-Cluster Replication #

Verification #

Common failure modes #

Frequently asked questions #

Related #

Prerequisites

Best-practice configuration, step by step

1. Define explicit states with combined rollover conditions

2. Bind the policy through a versioned index template

3. Align tier allocation with disk watermarks

4. Deploy the policy idempotently from Python

5. Isolate follower lifecycles under Cross-Cluster Replication

Verification

Common failure modes

Frequently asked questions

Related