Best practices for OpenSearch index lifecycle management
Implementing deterministic index rotation requires replacing ad-hoc cron jobs with explicit state machines governed by OpenSearch Index State Management (ISM). Production environments demand precise rollover thresholds, tier-aware allocation routing, and automated failure recovery. Understanding the underlying OpenSearch ISM Architecture & Fundamentals is critical before deploying policies at scale. This guide delivers exact configurations, debugging workflows, and Python automation patterns for data platform engineers and DevOps teams.
Deterministic Policy State Machine Design
Avoid implicit transitions. Every ISM policy must define explicit conditions and actions per state. Rollover triggers should combine age, size, and document count to prevent oversized shards or excessive small indices. Use the following production-grade policy structure:
PUT _plugins/_ism/policies/log-lifecycle
{
"policy": {
"description": "Production log tiering with deterministic rollover",
"default_state": "hot",
"states": [
{
"name": "hot",
"actions": [
{
"rollover": {
"min_index_age": "1d",
"min_primary_shard_size": "30gb",
"min_doc_count": 50000000
}
}
],
"transitions": [
{ "state_name": "warm", "conditions": { "min_index_age": "3d" } }
]
},
{
"name": "warm",
"actions": [
{ "allocation": { "require": { "box_type": "warm" } } },
{ "shrink": { "num_new_shards": 1 } },
{ "force_merge": { "max_num_segments": 1 } }
],
"transitions": [
{ "state_name": "cold", "conditions": { "min_index_age": "14d" } }
]
},
{
"name": "cold",
"actions": [
{ "allocation": { "require": { "box_type": "cold" }, "wait_for": false } }
],
"transitions": [
{ "state_name": "delete", "conditions": { "min_index_age": "60d" } }
]
},
{
"name": "delete",
"actions": [{ "delete": {} }]
}
],
"ism_template": [{
"index_patterns": ["logs-*"],
"priority": 100
}]
}
}
Attach policies via index templates using index.plugins.index_state_management.policy_id. Enforce Index Template Versioning by incrementing priority and using version fields to prevent template collision during rolling deployments. Always retrieve and validate a stored policy using GET _plugins/_ism/policies/<policy_id> (the explain API, GET _plugins/_ism/explain/<index>, operates on indices, not policies).
Tier-Aware Allocation & Routing Thresholds
ISM allocation relies on explicit node attributes. Configure node.attr.box_type: hot|warm|cold in opensearch.yml during cluster provisioning. ISM allocation actions fail silently if cluster.routing.allocation.enable is restricted or disk watermarks block relocation. Align watermarks with physical storage capacity to prevent premature shard throttling:
PUT _cluster/settings
{
"persistent": {
"cluster.routing.allocation.disk.watermark.low": "85%",
"cluster.routing.allocation.disk.watermark.high": "90%",
"cluster.routing.allocation.disk.watermark.flood_stage": "95%",
"cluster.routing.allocation.disk.threshold_enabled": true
}
}
Set wait_for: false in cold-tier allocations to prevent ISM worker threads from blocking on slow storage initialization. Verify routing compliance using GET _cat/shards?v&h=index,shard,prirep,state,node&s=index.
Cross-Cluster Replication (CCR) Synchronization
ISM and CCR operate independently at the cluster level. Leader indices execute rollover; follower indices must inherit lifecycle behavior via auto-follow patterns. Configure replication with strict index pattern isolation:
POST _plugins/_replication/_autofollow
{
"leader_alias": "prod-cluster",
"name": "logs-autofollow",
"pattern": "logs-*",
"use_roles": {
"leader_cluster_role": "all_access",
"follower_cluster_role": "all_access"
}
}
Apply identical ISM policies to follower clusters using matching index_patterns. Do not attempt to replicate ISM state metadata directly; instead, attach policies to follower indices via template inheritance. Monitor replication lag using GET _plugins/_replication/follower_stats and pause ISM transitions if the follower’s syncing_indices lag exceeds 30s to prevent data divergence during tier shifts.
Python Automation & Idempotent Policy Deployment
Automate policy distribution and validation using production-safe HTTP clients. The following script handles idempotent creation, conflict resolution, and attachment verification:
import requests
import json
import sys
from urllib3.util.retry import Retry
from requests.adapters import HTTPAdapter
OPENSEARCH_URL = "https://os-cluster.internal:9200"
AUTH = ("admin", "secure_password")
POLICY_ID = "log-lifecycle"
POLICY_PAYLOAD = {
"policy": {
"description": "Automated deployment policy",
"default_state": "hot",
"states": [{"name": "hot", "actions": [{"rollover": {"min_index_age": "1d"}}], "transitions": []}],
"ism_template": [{"index_patterns": ["logs-*"], "priority": 100}]
}
}
session = requests.Session()
retries = Retry(total=3, backoff_factor=1, status_forcelist=[429, 500, 502, 503, 504])
session.mount("https://", HTTPAdapter(max_retries=retries))
session.verify = "/etc/ssl/certs/ca-bundle.crt"
def deploy_policy():
url = f"{OPENSEARCH_URL}/_plugins/_ism/policies/{POLICY_ID}"
try:
# Create or update policy
resp = session.put(url, auth=AUTH, json=POLICY_PAYLOAD, timeout=10)
resp.raise_for_status()
print(f"[SUCCESS] Policy '{POLICY_ID}' deployed.")
except requests.exceptions.HTTPError as e:
if e.response.status_code == 409:
print(f"[INFO] Policy exists. Updating...")
resp = session.put(f"{url}?if_seq_no={e.response.json().get('_seq_no')}&if_primary_term={e.response.json().get('_primary_term')}", auth=AUTH, json=POLICY_PAYLOAD, timeout=10)
resp.raise_for_status()
else:
print(f"[ERROR] Deployment failed: {e}")
sys.exit(1)
# Verify attachment readiness
explain = session.get(f"{OPENSEARCH_URL}/_plugins/_ism/explain/logs-2024.01.01", auth=AUTH, timeout=10)
if explain.status_code == 200:
print("[SUCCESS] Policy attachment verified on target index.")
else:
print("[WARN] Target index not yet attached. Template priority may require adjustment.")
if __name__ == "__main__":
deploy_policy()
Parse JSON responses using standard Python json module for downstream validation. Always wrap API calls in exponential backoff to handle ISM worker thread contention during peak ingestion windows.
Debugging Workflows & Failure Recovery
When ISM transitions stall, isolate the failure point using the explain API:
GET _plugins/_ism/explain/<index_name>
Inspect the info object for failed flags, retry_count, and last_retry_time. If allocation fails due to watermark thresholds, apply transient overrides to unblock stuck shards:
PUT _cluster/settings
{
"transient": {
"cluster.routing.allocation.disk.watermark.low": "90%",
"cluster.routing.allocation.disk.watermark.high": "95%"
}
}
Reset transient settings immediately after recovery to prevent long-term capacity degradation. For persistent policy errors, clear the stuck state and re-trigger evaluation:
POST _plugins/_ism/change_policy/<index_name>
{
"policy_id": "log-lifecycle",
"state": "hot"
}
Monitor ISM worker health via GET _cat/indices?v&h=index,health,status,phase,action,retry.failed&s=retry.failed:desc. High retry counts indicate misconfigured conditions or insufficient node capacity for shrink operations.
Security Boundaries & Access Control
Restrict ISM API access using role-based access control (RBAC). Map cluster:admin/plugins/ism/policy/* and indices:admin/plugins/ism/* to dedicated automation service accounts. Avoid granting cluster:admin/cluster/settings to policy management roles; isolate watermark and routing configuration to infrastructure operators. Validate policy execution boundaries by auditing GET _security/role/ism_automation_role and ensuring index_patterns in ism_template do not overlap with security or system indices.
Mastering deterministic state transitions, tier routing, and automated recovery ensures predictable storage utilization and consistent query performance. For foundational configuration patterns, review Index Lifecycle Basics before scaling policies across multi-cluster environments.