ISM Policy Implementation & Python Automation

OpenSearch Index State Management (ISM) turns index lifecycle — rollover, tiered allocation, force-merge, snapshot, and deletion — into a declarative state machine that an OpenSearch-native scheduler reconciles on a fixed interval. This guide is written for search and log-platform engineers, data-platform operators, and Python automation builders who need version-controlled, auditable, repeatable ISM deployments across single clusters and Cross-Cluster Replication (CCR) topologies, rather than hand-clicked policies that drift out of sync.

Policy JSON is versioned in git and validated in CI before an idempotent apply step writes it to the OpenSearch cluster — and separately to each CCR follower, since attachments do not replicate. The verify loop polls _ism/explain and re-applies through change_policy until every index reaches its expected state.

The OpenSearch Dashboards UI is fine for authoring a first policy, but production lifecycle management demands the same rigor as any other infrastructure: policies live in git, validate in CI, deploy through idempotent scripts, and are observed by state-verification loops. Manual configuration introduces drift, inconsistent rollover behavior, and uncontrolled replication states across distributed topologies. The sections below map the full automation surface — the state machine you are deploying, the storage topology it routes across, the policy schema you author, the access boundaries the automation account must respect, the Python client workflow that drives it, the CCR constraints that break naive scripts, and the APIs that verify every deploy landed. It builds directly on the execution model established in the OpenSearch ISM Architecture & Fundamentals guide.

Policy Architecture & State Machine Definition

An ISM policy is a declarative JSON document that defines a default_state, an ordered list of states, the actions inside each state, and the conditional transitions that advance an index from one state to the next. The ISM daemon evaluates these conditions on a background job scheduler controlled by plugins.index_state_management.job_interval (default 5 minutes), executing an action only when its transition condition is satisfied against live index metadata. Each managed index walks from default_state through the defined phases until a terminal state is reached. Understanding the Phase Transition Logic is critical for preventing indices from stalling in intermediate states or firing premature rollovers during ingestion spikes — production policies must enforce explicit failure handling, idempotent actions, and deterministic metadata updates.

Each managed index walks this loop: the scheduler tests one transition condition per interval, runs at most one action when it is met, records the outcome to the history index, and re-evaluates — while a failed action degrades into a bounded per-action retry rather than a stalled index.

The scheduler never runs two actions on the same index concurrently, and a failed action leaves the index in its current state to be retried on the next cycle. That guarantee is what makes ISM safe to automate: your deploy script asserts the desired policy, and the engine converges every managed index toward it without you orchestrating individual transitions. The canonical hot → warm → cold → delete lifecycle for time-series data looks like this:

JSON

{
  "policy": {
    "description": "Production telemetry lifecycle with tiered storage",
    "default_state": "hot",
    "ism_template": [
      { "index_patterns": ["telemetry-logs-*"], "priority": 100 }
    ],
    "states": [
      {
        "name": "hot",
        "actions": [
          {
            "rollover": {
              "min_index_age": "7d",
              "min_primary_shard_size": "40gb",
              "min_doc_count": 75000000
            }
          }
        ],
        "transitions": [
          { "state_name": "warm", "conditions": { "min_index_age": "14d" } }
        ]
      },
      {
        "name": "warm",
        "actions": [
          { "replica_count": { "number_of_replicas": 1 } },
          { "force_merge": { "max_num_segments": 1 } },
          { "allocation": { "require": { "data": "warm" }, "wait_for": true } }
        ],
        "transitions": [
          { "state_name": "cold", "conditions": { "min_index_age": "30d" } }
        ]
      },
      {
        "name": "cold",
        "actions": [
          { "allocation": { "require": { "data": "cold" }, "wait_for": true } },
          { "snapshot": { "repository": "s3-archive", "snapshot": "cold-{{ctx.index}}" } }
        ],
        "transitions": [
          { "state_name": "delete", "conditions": { "min_index_age": "90d" } }
        ]
      },
      {
        "name": "delete",
        "actions": [
          { "retry": { "count": 3, "backoff": "exponential", "delay": "1h" },
            "delete": {} }
        ],
        "transitions": []
      }
    ]
  }
}

Two rules keep an authored policy stable once automation applies it. First, set wait_for: true on every allocation action so ISM blocks until shards physically land on the target tier before a downstream force_merge runs on the wrong hardware. Second, prefer the ism_template block over the imperative _plugins/_ism/add call so newly created indices are claimed by pattern automatically and the attachment survives an index-template rebuild. Your automation still needs add for backfilling already-existing indices, but template-based attachment is what makes a fresh index inherit the policy without a follow-up script run.

Storage Topology, Node Roles & Disk Watermarks

An ISM policy can only be as reliable as the OpenSearch cluster topology beneath it, because its allocation action does nothing more than write routing filters that OpenSearch’s cluster manager must satisfy. Modern OpenSearch (2.x and later) segregates data nodes by dedicated role — data_hot, data_warm, data_cold, and data_frozen — alongside the legacy node.attr.data tag approach. The Node Role Allocation model determines how primaries and replicas are placed across those roles, and your automation must target the same attribute keys the topology actually uses, or every transition will hang in a WAITING state.

Tier	Node role	Storage profile	Routing attribute	Primary workload
Hot	`data_hot`	NVMe SSD, high IOPS	`require.data: hot`	Active ingest, rollover, real-time search
Warm	`data_warm`	SATA SSD, moderate IOPS	`require.data: warm`	Recent history, read-mostly search
Cold	`data_cold`	High-capacity HDD	`require.data: cold`	Compliance retention, infrequent queries
Frozen	`data_frozen`	Object storage (searchable snapshots)	`require.data: frozen`	Archival, rare investigative access

Where each boundary sits — how long data stays hot, how many replicas each tier carries — is the concern of Hot-Warm-Cold Tier Design, while the physical shard relocation your allocation action triggers is governed by the decider chain described in Data Tier Routing Patterns. Automation that changes a routing attribute without accounting for that decider chain is the most common source of transitions that appear to succeed in the policy but never move a shard.

Disk watermarks are the single largest cause of stalled transitions, so any deploy pipeline should assert them before rolling out a policy that adds a tier. OpenSearch stops allocating new shards to a node once used disk crosses the low watermark, relocates shards away at the high watermark, and applies a read-only-allow-delete block at the flood stage. A relocation into a target node is admitted only while:

\text{disk}_\text{used} + \text{shard}_\text{size} \le \text{watermark}_\text{high} \times \text{capacity}_\text{node}

The single-tier defaults (85% / 90% / 95%) are usually too aggressive during a migration wave, so reserve headroom cluster-wide before large ISM rollouts:

HTTP

PUT _cluster/settings
{
  "persistent": {
    "cluster.routing.allocation.disk.watermark.low": "82%",
    "cluster.routing.allocation.disk.watermark.high": "88%",
    "cluster.routing.allocation.disk.watermark.flood_stage": "93%",
    "cluster.routing.allocation.disk.threshold_enabled": true
  }
}

When a target tier genuinely runs out of capacity, the automation should degrade rather than hang. The Fallback Routing Strategies model defines how transitions and CCR sync cycles spill to an adjacent tier or hold ingest instead of triggering an uncontrolled shard storm.

Rollover Mechanics & Threshold Calibration

Rollover mechanics dictate when a write alias switches to a new backing index. Misaligned thresholds cause hot-warm imbalance, excessive shard fragmentation, or unbounded storage consumption on primary nodes. Proper Rollover Trigger Configuration balances min_index_age, min_primary_shard_size, and min_doc_count against ingestion velocity and storage IOPS, and a rollover fires when whichever bound is reached first — so shard sizing stays bounded regardless of ingest spikes.

Engineers must apply Threshold Tuning Strategies to align policy triggers with cluster capacity, ensuring that force_merge and read-only transitions execute without overwhelming the background merge and generic thread pools. When ingestion exhibits diurnal variance or seasonal spikes, static thresholds should be replaced with sizing derived from historical cluster metrics. A useful target for a rollover primary-shard bound, given a desired shard size $S_\text{target}$ and observed ingest rate $R$ (bytes/day) across $N_p$ primaries, is the age at which a shard would reach that size:

t_\text{roll} = \frac{S_\text{target} \times N_p}{R}

Setting min_index_age at or slightly below $t_\text{roll}$ and min_primary_shard_size at $S_\text{target}$ gives the age bound priority under steady load while the size bound still caps a burst. This is the calculation your automation should recompute from _cat/indices metrics rather than hard-coding, so thresholds track real ingest instead of a guess made at design time.

Python Automation & CI/CD Integration

At scale, ISM is managed as code. The opensearch-py client provides native support for the _plugins/_ism endpoints, enabling programmatic policy creation, attachment, and state verification with typed signatures and structured error handling. The deploy loop below is the backbone of a CI/CD job: it upserts the policy, attaches it by pattern, and polls explain until every index reaches the expected managed state.

The deploy job is idempotent end to end: re-PUTting an unchanged policy is a no-op and attaching an already-managed index is skipped, so the loop can run on every merge and only stops once explain confirms each index reached its expected state.

Python

import logging
from typing import Any, Dict
from opensearchpy import OpenSearch, ConnectionError, NotFoundError, TransportError

logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")
logger = logging.getLogger(__name__)


def upsert_ism_policy(client: OpenSearch, policy_id: str, policy: Dict[str, Any]) -> Dict[str, Any]:
    """Create or update a policy. Idempotent: re-PUT of unchanged JSON is a safe no-op
    the plugin resolves via the internal sequence number, so this is safe on every deploy."""
    try:
        return client.transport.perform_request(
            method="PUT", url=f"/_plugins/_ism/policies/{policy_id}", body=policy
        )
    except TransportError as e:
        logger.error("Policy upsert failed for '%s': %s", policy_id, e.info)
        raise


def attach_ism_policy(client: OpenSearch, index_pattern: str, policy_id: str) -> Dict[str, Any]:
    """Attach a policy to matching indices. Already-managed indices are skipped by the
    plugin, so attaching on every deploy will not double-manage an index."""
    try:
        response = client.transport.perform_request(
            method="POST",
            url=f"/_plugins/_ism/add/{index_pattern}",
            body={"policy_id": policy_id},
        )
        logger.info("Policy '%s' attached to pattern '%s'", policy_id, index_pattern)
        return response
    except ConnectionError as e:
        logger.error("Cluster connection failed during attach: %s", e)
        raise
    except TransportError as e:
        logger.error("ISM transport error attaching '%s': %s", index_pattern, e.info)
        raise


def verify_ism_state(client: OpenSearch, index_name: str) -> Dict[str, Any]:
    """Read current ISM metadata for one index via the explain API for post-deploy assertion."""
    try:
        return client.transport.perform_request(
            method="GET", url=f"/_plugins/_ism/explain/{index_name}"
        )
    except NotFoundError:
        logger.warning("Index '%s' not found in cluster.", index_name)
        return {}


if __name__ == "__main__":
    client = OpenSearch(
        hosts=[{"host": "opensearch-master.internal", "port": 9200}],
        http_auth=("automation_svc", "REDACTED"),
        use_ssl=True,
        verify_certs=True,
        maxsize=10,
        retry_on_timeout=True,
        max_retries=3,
    )

Version-control the policy JSON alongside your Terraform or Kubernetes manifests and validate it against the OpenSearch schema before deployment, so a malformed state graph or an invalid cron expression never reaches production. Every step above is idempotent, which is what lets the pipeline run on every merge without side effects: re-applying an unchanged policy resolves to a no-op, and attaching a policy to an already-managed index is skipped by the plugin.

For clusters managing thousands of indices, synchronous HTTP calls become the bottleneck — a serial attach-then-poll across ten thousand indices can take longer than the job_interval itself. Implementing Async Execution Patterns with AsyncOpenSearch and asyncio.gather lets policy validation, attachment, and verification run concurrently without blocking the main thread. When you outgrow standalone scripts, adopting established Python Orchestration Frameworks — Airflow, Prefect, or a Kubernetes CronJob — adds scheduled execution, dependency tracking, and native CI/CD integration. The official OpenSearch ISM documentation specifies each endpoint that these modules should abstract into reusable, testable functions.

Security & Access Control for the Automation Account

Because _plugins/_ism/* actions mutate cluster state and index metadata, the automation account that drives your pipeline must be scoped with fine-grained access control (FGAC) and nothing more. The Security & Access Boundaries model governs which service accounts may write policies, trigger transitions, or read execution history. The governing principle: only platform automation roles get write access to lifecycle policies, while application teams receive read-only visibility through explain.

JSON

{
  "cluster_permissions": [
    "cluster:admin/opendistro/ism/policy/write",
    "cluster:admin/opendistro/ism/policy/get",
    "cluster:admin/opendistro/ism/managedindex/add",
    "cluster:admin/opendistro/ism/managedindex/explain"
  ],
  "index_permissions": [
    {
      "index_patterns": ["telemetry-logs-*", "audit-events-*", ".opendistro-ism-*"],
      "allowed_actions": ["indices:admin/settings/update", "indices:monitor/settings/get"]
    }
  ]
}

Grant application-facing roles only policy/get and managedindex/explain, never policy/write or managedindex/retry, so a consumer can observe why an index is stuck without being able to force a transition. Every action ISM performs is recorded in the .opendistro-ism-managed-index-history-* indices, which double as your audit trail — restrict read access to them, since they expose snapshot repository names, tier topology, and retention windows that are useful reconnaissance for an attacker. Keep the automation account’s credentials in a secrets manager, inject them at runtime, and never embed them in policy JSON or commit them to the repository that holds your policies.

Resilience & Error Management

Network partitions, cluster-manager elections, and transient API rate limits can interrupt ISM operations mid-execution. A resilient automation layer implements exponential backoff, bounded retries, and idempotent state reconciliation so a partial failure never leaves indices half-managed. Comprehensive Error Handling & Retries ensures that failed attachments or stuck transitions are retried without duplicate operations or metadata corruption, distinguishing a transient error worth retrying from a permanent one that should fail the deploy.

Before any remediation, query _plugins/_ism/explain/<index> to read the current state, action, and failure reason — never blindly re-apply. Pair the explain output with structured logging and cluster-health metrics so every automated retry leaves an audit trail that accelerates root-cause analysis. The retry contract inside the policy itself (the per-action retry block with count, backoff, and delay) handles in-cluster transient failures, while your Python layer handles the failures that happen before an action is ever scheduled — connection errors, auth failures, and schema-validation rejections.

Cross-Cluster Replication Integration

ISM and Cross-Cluster Replication (CCR) intersect in ways that break naive automation. Policy attachments do not propagate over replication: a policy attached on the leader has no effect on the follower, because replication carries index data and settings, not ISM metadata. Your deploy script must therefore target both clusters explicitly, applying a follower-appropriate policy independently on the follower.

Follower indices are read-only while replication is active, so any action that mutates the write index — chiefly rollover — conflicts with the replication engine and can produce split-brain state where leader and follower disagree on which index is current. Design follower policies to prioritize delete and snapshot for local retention and archival, and omit rollover entirely while replication runs. Automation that promotes a follower during failover must detach the replication relationship first, then let the follower’s own ISM policy resume rollover — doing it in the wrong order is the classic cause of a follower that silently stops advancing. Follower policies should also use allocation filters compatible with the follower cluster’s own topology, which may differ from the leader’s; the Node Role Allocation guide covers overriding inherited allocation tags when the two clusters use different tier layouts.

Operational Safety & Rollback

A policy misconfiguration can cascade across thousands of indices, causing premature deletions, unbounded storage growth, or replication desynchronization — so treat every rollout as reversible. Version-controlled policy repositories combined with dry-run validation and automated state snapshots let engineers safely detach a faulty policy, revert indices to a known state, and reapply a corrected configuration without manual per-index intervention. Before deploying an update, the pipeline should validate JSON syntax, simulate transitions against a staging cluster, and verify CCR follower compatibility.

Rollback in ISM is not a single API call — it is a discipline. To retire a bad policy safely: attach the previous known-good policy_id with change_policy so in-flight indices finish their current state before switching; snapshot any index whose next action is delete before it fires; and confirm through explain that no index is mid-transition when you detach. Because a delete action is irreversible, gate any policy that contains one behind an explicit approval step in CI, and keep the retention window generous enough that a snapshot restore is always possible.

Operational Monitoring & Alerting

Observing an ISM deployment comes down to three API families. Use _cat/indices for a fast inventory of index sizes, doc counts, and health; use _plugins/_ism/explain to see the exact state, action, and any failure reason for a managed index; and use _cluster/allocation/explain to diagnose why a shard will not move during a tier transition.

Shell

# Fast inventory: size, docs, health per index
curl -s "https://<cluster>:9200/_cat/indices/telemetry-logs-*?v&h=index,health,docs.count,store.size&s=index"

# Why is this index stuck? — state, action, and failure reason
curl -s "https://<cluster>:9200/_plugins/_ism/explain/telemetry-logs-000012?pretty"

# Why can't this shard allocate to the target tier?
curl -s -X POST "https://<cluster>:9200/_cluster/allocation/explain" \
  -H "Content-Type: application/json" \
  -d '{"index": "telemetry-logs-000012", "shard": 0, "primary": true}'

Wire the explain output into alerting. The signal that matters is a managed index whose action reports a non-empty failed flag or an info.message that has not changed across two evaluation cycles — that indicates a transition stuck on watermarks, a missing snapshot repository, or a routing mismatch. Alert on the count of indices in a failed action state rather than on individual transitions, so a single transient retry does not page anyone, and cross-reference stuck transitions against disk-watermark and unassigned-shard metrics to separate a capacity problem from a policy problem.

Frequently Asked Questions

Should my pipeline attach policies with ism_template or the _ism/add API?

Prefer the ism_template block inside the policy so new indices are claimed automatically by pattern and the attachment survives index recreation. Use _plugins/_ism/add only to backfill already-existing indices; those manual attachments must be reapplied after an index-template rebuild, which is why template-based attachment is safer for continuous deployment.

Is it safe to run my deploy script on every merge?

Yes, if every step is idempotent. Re-applying an unchanged policy with PUT _plugins/_ism/policies/{id} is a no-op, and attaching a policy to an already-managed index is skipped by the plugin. Validate the JSON against the schema in CI first so a malformed state graph never reaches the OpenSearch cluster.

Why do my policies not appear on the CCR follower cluster?

Replication carries index data and settings but not ISM policy attachments. Attach a follower-specific policy directly on the follower, and omit rollover on follower indices while replication is active to avoid split-brain state between leader and follower.

How do I roll back a policy that is deleting indices too early?

Use change_policy to attach the previous known-good policy_id so in-flight indices finish their current state before switching, snapshot any index whose next action is delete, and confirm through _plugins/_ism/explain that nothing is mid-transition before detaching. Because delete is irreversible, gate policies containing it behind an approval step in CI.

Phase Transition Logic — how the state machine evaluates conditions and commits transitions without stalling.
Rollover Trigger Configuration — tuning min_index_age, min_primary_shard_size, and min_doc_count for the write alias.
Threshold Tuning Strategies — deriving age and size thresholds from real ingest metrics instead of guesses.
Async Execution Patterns — concurrent policy attach and verification across thousands of indices with AsyncOpenSearch.
Python Orchestration Frameworks — scheduled, dependency-aware ISM automation in Airflow, Prefect, and Kubernetes.
Error Handling & Retries — backoff, circuit breakers, and idempotent reconciliation for interrupted operations.

Up next: return to the index-state-management.org home page for the full map of OpenSearch ISM and Cross-Cluster Replication guides, or review the OpenSearch ISM Architecture & Fundamentals execution model this automation builds on.

ISM Policy Implementation & Python Automation

Policy Architecture & State Machine Definition #

Storage Topology, Node Roles & Disk Watermarks #

Rollover Mechanics & Threshold Calibration #

Python Automation & CI/CD Integration #

Security & Access Control for the Automation Account #

Resilience & Error Management #

Cross-Cluster Replication Integration #

Operational Safety & Rollback #

Operational Monitoring & Alerting #

Frequently Asked Questions #

Related Guides #