How to configure OpenSearch ISM hot warm cold architecture
Deploying a tiered storage model requires deterministic alignment between hardware capabilities, shard routing rules, and Index State Management (ISM) policy execution. Misconfigured node attributes, malformed phase thresholds, or silent policy detachment directly cause unassigned shards, uncontrolled rollovers, and degraded query latency. This guide delivers exact configuration payloads, routing enforcement patterns, and Python automation workflows to implement a production-grade lifecycle. Understanding the underlying OpenSearch ISM Architecture & Fundamentals is mandatory before applying these configurations to production clusters.
1. Node Role Allocation & Tier Routing Configuration
OpenSearch routes indices to specific hardware tiers using node attributes. ISM relies on index.routing.allocation.require.data to enforce placement during phase transitions. Configure opensearch.yml on each node group with explicit role assignments and custom attributes. Do not rely on implicit tier roles alone; explicit attributes guarantee deterministic routing across cluster upgrades.
Hot Tier (High IOPS, NVMe, Ingest/Query)
node.name: "hot-01"
node.roles: ["data_hot", "ingest"]
node.attr.data: "hot"
path.data: /mnt/nvme/opensearch
Warm Tier (Balanced I/O, SSD, Search/Retention)
node.name: "warm-01"
node.roles: ["data_warm"]
node.attr.data: "warm"
path.data: /mnt/ssd/opensearch
Cold Tier (High Capacity, HDD, Archive)
node.name: "cold-01"
node.roles: ["data_cold"]
node.attr.data: "cold"
path.data: /mnt/hdd/opensearch
Restart nodes sequentially after applying changes. Verify attribute propagation before creating indices:
curl -s -X GET "localhost:9200/_cat/nodes?v&h=name,attr.data,node.role"
If attr.data does not appear in the output, routing enforcement will fail during ISM transitions. Cross-reference your hardware layout with established Hot-Warm-Cold Tier Design patterns to ensure capacity ratios match expected ingestion velocity and retention windows.
2. Index Template Versioning & Routing Enforcement
Templates must declare the ISM policy attachment and initial routing allocation at index creation. Use the _index_template API (v2) to enforce versioning, prevent legacy template collisions, and guarantee deterministic shard distribution.
PUT _index_template/logs_tiered_v2
{
"index_patterns": ["logs-app-*"],
"priority": 500,
"template": {
"settings": {
"index.plugins.index_state_management.policy_id": "hot_warm_cold_policy",
"index.routing.allocation.require.data": "hot",
"number_of_shards": 3,
"number_of_replicas": 1,
"refresh_interval": "5s",
"translog.flush_threshold_size": "512mb"
},
"mappings": {
"properties": {
"@timestamp": { "type": "date" },
"message": { "type": "text" },
"trace_id": { "type": "keyword" }
}
}
}
}
Operational constraints:
index.plugins.index_state_management.policy_idbinds the index to the ISM policy immediately upon creation. (This is the OpenSearch setting;index.lifecycle.nameis the Elasticsearch ILM equivalent and is ignored by OpenSearch.)index.routing.allocation.require.dataforces initial shard placement exclusively on hot nodes.prioritymust exceed legacy templates (>100) to prevent override conflicts during template merges.- Version suffixes (
_v2) enable safe rollbacks without disrupting active indices or triggering unintended rollovers.
3. ISM Policy Definition & Phase Thresholds
The policy defines deterministic state transitions, rollover triggers, and routing reallocations. ISM evaluates conditions every 5 minutes by default (plugins.index_state_management.job_interval). Configure thresholds to align with storage capacity and query SLAs.
PUT _plugins/_ism/policies/hot_warm_cold_policy
{
"policy": {
"description": "Tiered lifecycle for application logs",
"default_state": "hot",
"states": [
{
"name": "hot",
"actions": [
{
"rollover": {
"min_index_age": "1d",
"min_primary_shard_size": "50gb"
}
}
],
"transitions": [
{
"state_name": "warm",
"conditions": { "min_index_age": "30d" }
}
]
},
{
"name": "warm",
"actions": [
{
"allocation": {
"require": { "data": "warm" }
}
},
{
"replica_count": { "number_of_replicas": 0 }
},
{
"force_merge": { "max_num_segments": 1 }
}
],
"transitions": [
{
"state_name": "cold",
"conditions": { "min_index_age": "60d" }
}
]
},
{
"name": "cold",
"actions": [
{
"allocation": {
"require": { "data": "cold" }
}
},
{
"read_only": {}
}
],
"transitions": [
{
"state_name": "delete",
"conditions": { "min_index_age": "90d" }
}
]
},
{
"name": "delete",
"actions": [
{ "delete": {} }
]
}
]
}
}
Threshold tuning guidelines:
- Hot → Warm: Trigger at
30dto offload recent indices before NVMe saturation. - Warm → Cold: Trigger at
60dafterforce_mergecompletes. Segment compaction reduces query overhead and disk footprint. - Cold → Delete: Trigger at
90dfor compliance-driven retention. Theread_onlyaction prevents accidental writes during archival. - Adjust
min_primary_shard_sizebased on actual shard density. Oversized shards delay warm/cold transitions; undersized shards increase cluster overhead.
4. Python Automation & Bulk Policy Attachment
Production environments require idempotent policy attachment across existing indices and automated drift detection. The following script uses opensearch-py to safely attach policies, handle pagination, and log failures without blocking ingestion pipelines.
import logging
from opensearchpy import OpenSearch, exceptions
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")
logger = logging.getLogger(__name__)
def attach_ism_policy(client, index_pattern, policy_id):
"""Attach ISM policy to all matching indices with retry logic."""
try:
indices = client.cat.indices(index=index_pattern, h="index", format="json")
attached = 0
for idx in indices:
idx_name = idx["index"]
try:
# OpenSearch ISM attaches via the add API, not an index setting.
client.transport.perform_request(
"POST",
f"/_plugins/_ism/add/{idx_name}",
body={"policy_id": policy_id},
)
attached += 1
except exceptions.RequestError as e:
if "already" in str(e).lower():
logger.info(f"Policy already attached to {idx_name}")
else:
logger.error(f"Failed to attach policy to {idx_name}: {e}")
logger.info(f"Successfully attached policy to {attached} indices.")
except exceptions.ConnectionError as e:
logger.critical(f"Cluster connection failed: {e}")
except Exception as e:
logger.error(f"Unexpected error during policy attachment: {e}")
if __name__ == "__main__":
host = "https://opensearch-cluster.internal:9200"
client = OpenSearch(
hosts=[host],
http_auth=("admin", "secure_password"),
verify_certs=True,
timeout=30
)
attach_ism_policy(client, "logs-app-*", "hot_warm_cold_policy")
Reference the official OpenSearch Python Client documentation for authentication methods and connection pooling configurations. Schedule this script via cron or Kubernetes CronJob to enforce policy compliance after template updates or cluster migrations.
5. Debugging & Fallback Routing Strategies
Silent policy drops and unassigned shards typically stem from routing attribute mismatches or stuck transitions. Use the following diagnostic workflow to isolate failures and enforce recovery.
Verify Policy Execution State
curl -s -X GET "localhost:9200/_plugins/_ism/explain/logs-app-2024.01.01"
Inspect state, action, and retry_info. If state remains INIT for >5 minutes, the ISM coordinator is blocked or the index lacks the lifecycle setting.
Diagnose Unassigned Shards
curl -s -X GET "localhost:9200/_cluster/allocation/explain?pretty"
Look for can_allocate: "no" and explanation fields. Common causes:
- Missing
node.attr.dataon target nodes. - Insufficient disk watermark thresholds (
cluster.routing.allocation.disk.watermark.high). - Conflicting
index.routing.allocationrules across phases.
Force Transition & Re-attach Policy If an index is stuck in a failed state, manually advance it:
curl -s -X POST "localhost:9200/_plugins/_ism/retry/logs-app-2024.01.01"
For routing mismatches during node decommissioning, apply a fallback allocation override (re-attach the policy separately via POST _plugins/_ism/add/<index> — a settings PUT cannot attach an ISM policy):
PUT logs-app-2024.01.01/_settings
{
"index.routing.allocation.require.data": "warm"
}
Monitor cluster health with GET _cluster/health?wait_for_status=yellow&timeout=60s. If shards remain unassigned after routing correction, verify that the target tier has sufficient free disk space and that cluster.routing.allocation.enable is not set to none.
Implement automated alerting on _plugins/_ism/explain failures and _cluster/allocation/explain outputs. Proactive routing validation prevents cascading shard relocation storms during peak ingestion windows. For deeper architectural validation, consult the official Index State Management API Reference.