How to configure OpenSearch ISM hot warm cold architecture

Deploying a tiered storage model requires deterministic alignment between hardware capabilities, shard routing rules, and Index State Management (ISM) policy execution. Misconfigured node attributes, malformed phase thresholds, or silent policy detachment directly cause unassigned shards, uncontrolled rollovers, and degraded query latency. This guide delivers exact configuration payloads, routing enforcement patterns, and Python automation workflows to implement a production-grade lifecycle. Understanding the underlying OpenSearch ISM Architecture & Fundamentals is mandatory before applying these configurations to production clusters.

1. Node Role Allocation & Tier Routing Configuration

OpenSearch routes indices to specific hardware tiers using node attributes. ISM relies on index.routing.allocation.require.data to enforce placement during phase transitions. Configure opensearch.yml on each node group with explicit role assignments and custom attributes. Do not rely on implicit tier roles alone; explicit attributes guarantee deterministic routing across cluster upgrades.

Hot Tier (High IOPS, NVMe, Ingest/Query)

YAML
node.name: "hot-01"
node.roles: ["data_hot", "ingest"]
node.attr.data: "hot"
path.data: /mnt/nvme/opensearch

Warm Tier (Balanced I/O, SSD, Search/Retention)

YAML
node.name: "warm-01"
node.roles: ["data_warm"]
node.attr.data: "warm"
path.data: /mnt/ssd/opensearch

Cold Tier (High Capacity, HDD, Archive)

YAML
node.name: "cold-01"
node.roles: ["data_cold"]
node.attr.data: "cold"
path.data: /mnt/hdd/opensearch

Restart nodes sequentially after applying changes. Verify attribute propagation before creating indices:

Shell
curl -s -X GET "localhost:9200/_cat/nodes?v&h=name,attr.data,node.role"

If attr.data does not appear in the output, routing enforcement will fail during ISM transitions. Cross-reference your hardware layout with established Hot-Warm-Cold Tier Design patterns to ensure capacity ratios match expected ingestion velocity and retention windows.

2. Index Template Versioning & Routing Enforcement

Templates must declare the ISM policy attachment and initial routing allocation at index creation. Use the _index_template API (v2) to enforce versioning, prevent legacy template collisions, and guarantee deterministic shard distribution.

JSON
PUT _index_template/logs_tiered_v2
{
  "index_patterns": ["logs-app-*"],
  "priority": 500,
  "template": {
    "settings": {
      "index.plugins.index_state_management.policy_id": "hot_warm_cold_policy",
      "index.routing.allocation.require.data": "hot",
      "number_of_shards": 3,
      "number_of_replicas": 1,
      "refresh_interval": "5s",
      "translog.flush_threshold_size": "512mb"
    },
    "mappings": {
      "properties": {
        "@timestamp": { "type": "date" },
        "message": { "type": "text" },
        "trace_id": { "type": "keyword" }
      }
    }
  }
}

Operational constraints:

  • index.plugins.index_state_management.policy_id binds the index to the ISM policy immediately upon creation. (This is the OpenSearch setting; index.lifecycle.name is the Elasticsearch ILM equivalent and is ignored by OpenSearch.)
  • index.routing.allocation.require.data forces initial shard placement exclusively on hot nodes.
  • priority must exceed legacy templates (>100) to prevent override conflicts during template merges.
  • Version suffixes (_v2) enable safe rollbacks without disrupting active indices or triggering unintended rollovers.

3. ISM Policy Definition & Phase Thresholds

The policy defines deterministic state transitions, rollover triggers, and routing reallocations. ISM evaluates conditions every 5 minutes by default (plugins.index_state_management.job_interval). Configure thresholds to align with storage capacity and query SLAs.

JSON
PUT _plugins/_ism/policies/hot_warm_cold_policy
{
  "policy": {
    "description": "Tiered lifecycle for application logs",
    "default_state": "hot",
    "states": [
      {
        "name": "hot",
        "actions": [
          {
            "rollover": {
              "min_index_age": "1d",
              "min_primary_shard_size": "50gb"
            }
          }
        ],
        "transitions": [
          {
            "state_name": "warm",
            "conditions": { "min_index_age": "30d" }
          }
        ]
      },
      {
        "name": "warm",
        "actions": [
          {
            "allocation": {
              "require": { "data": "warm" }
            }
          },
          {
            "replica_count": { "number_of_replicas": 0 }
          },
          {
            "force_merge": { "max_num_segments": 1 }
          }
        ],
        "transitions": [
          {
            "state_name": "cold",
            "conditions": { "min_index_age": "60d" }
          }
        ]
      },
      {
        "name": "cold",
        "actions": [
          {
            "allocation": {
              "require": { "data": "cold" }
            }
          },
          {
            "read_only": {}
          }
        ],
        "transitions": [
          {
            "state_name": "delete",
            "conditions": { "min_index_age": "90d" }
          }
        ]
      },
      {
        "name": "delete",
        "actions": [
          { "delete": {} }
        ]
      }
    ]
  }
}

Threshold tuning guidelines:

  • Hot → Warm: Trigger at 30d to offload recent indices before NVMe saturation.
  • Warm → Cold: Trigger at 60d after force_merge completes. Segment compaction reduces query overhead and disk footprint.
  • Cold → Delete: Trigger at 90d for compliance-driven retention. The read_only action prevents accidental writes during archival.
  • Adjust min_primary_shard_size based on actual shard density. Oversized shards delay warm/cold transitions; undersized shards increase cluster overhead.

4. Python Automation & Bulk Policy Attachment

Production environments require idempotent policy attachment across existing indices and automated drift detection. The following script uses opensearch-py to safely attach policies, handle pagination, and log failures without blocking ingestion pipelines.

Python
import logging
from opensearchpy import OpenSearch, exceptions

logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
logger = logging.getLogger(__name__)

def attach_ism_policy(client, index_pattern, policy_id):
    """Attach ISM policy to all matching indices with retry logic."""
    try:
        indices = client.cat.indices(index=index_pattern, h="index", format="json")
        attached = 0
        for idx in indices:
            idx_name = idx["index"]
            try:
                # OpenSearch ISM attaches via the add API, not an index setting.
                client.transport.perform_request(
                    "POST",
                    f"/_plugins/_ism/add/{idx_name}",
                    body={"policy_id": policy_id},
                )
                attached += 1
            except exceptions.RequestError as e:
                if "already" in str(e).lower():
                    logger.info(f"Policy already attached to {idx_name}")
                else:
                    logger.error(f"Failed to attach policy to {idx_name}: {e}")
        logger.info(f"Successfully attached policy to {attached} indices.")
    except exceptions.ConnectionError as e:
        logger.critical(f"Cluster connection failed: {e}")
    except Exception as e:
        logger.error(f"Unexpected error during policy attachment: {e}")

if __name__ == "__main__":
    host = "https://opensearch-cluster.internal:9200"
    client = OpenSearch(
        hosts=[host],
        http_auth=("admin", "secure_password"),
        verify_certs=True,
        timeout=30
    )
    attach_ism_policy(client, "logs-app-*", "hot_warm_cold_policy")

Reference the official OpenSearch Python Client documentation for authentication methods and connection pooling configurations. Schedule this script via cron or Kubernetes CronJob to enforce policy compliance after template updates or cluster migrations.

5. Debugging & Fallback Routing Strategies

Silent policy drops and unassigned shards typically stem from routing attribute mismatches or stuck transitions. Use the following diagnostic workflow to isolate failures and enforce recovery.

Verify Policy Execution State

Shell
curl -s -X GET "localhost:9200/_plugins/_ism/explain/logs-app-2024.01.01"

Inspect state, action, and retry_info. If state remains INIT for >5 minutes, the ISM coordinator is blocked or the index lacks the lifecycle setting.

Diagnose Unassigned Shards

Shell
curl -s -X GET "localhost:9200/_cluster/allocation/explain?pretty"

Look for can_allocate: "no" and explanation fields. Common causes:

  • Missing node.attr.data on target nodes.
  • Insufficient disk watermark thresholds (cluster.routing.allocation.disk.watermark.high).
  • Conflicting index.routing.allocation rules across phases.

Force Transition & Re-attach Policy If an index is stuck in a failed state, manually advance it:

Shell
curl -s -X POST "localhost:9200/_plugins/_ism/retry/logs-app-2024.01.01"

For routing mismatches during node decommissioning, apply a fallback allocation override (re-attach the policy separately via POST _plugins/_ism/add/<index> — a settings PUT cannot attach an ISM policy):

JSON
PUT logs-app-2024.01.01/_settings
{
  "index.routing.allocation.require.data": "warm"
}

Monitor cluster health with GET _cluster/health?wait_for_status=yellow&timeout=60s. If shards remain unassigned after routing correction, verify that the target tier has sufficient free disk space and that cluster.routing.allocation.enable is not set to none.

Implement automated alerting on _plugins/_ism/explain failures and _cluster/allocation/explain outputs. Proactive routing validation prevents cascading shard relocation storms during peak ingestion windows. For deeper architectural validation, consult the official Index State Management API Reference.