Configuring index size and age thresholds for rollover
Configuring index size and age thresholds for rollover requires deterministic alignment between OpenSearch Index State Management (ISM) policy conditions, shard allocation constraints, and replication topology. Misconfigured thresholds trigger premature rollovers, uncontrolled shard proliferation, or Cross-Cluster Replication (CCR) follower desynchronization. This guide delivers exact DSL payloads, evaluation mechanics, and Python orchestration patterns to enforce strict rollover boundaries and minimize mean time to resolution (MTTR) for production data platforms.
DSL Configuration & Condition Evaluation Mechanics
The rollover action evaluates threshold conditions directly on the action object. OpenSearch applies logical OR evaluation across rollover conditions: the index rolls over as soon as any configured condition (min_size, min_index_age, min_doc_count, min_primary_shard_size) is satisfied. Use whole-number unit values for predictability, with standard suffixes (gb, tb, h, d, m). The min_size condition measures total primary shard storage, excluding replicas; min_primary_shard_size measures the largest single primary shard.
PUT _plugins/_ism/policies/log-rollover-policy
{
"policy": {
"description": "Deterministic size and age rollover for log indices",
"default_state": "hot",
"ism_template": [
{
"index_patterns": ["logs-*"],
"priority": 100
}
],
"states": [
{
"name": "hot",
"actions": [
{
"rollover": {
"min_size": "45gb",
"min_index_age": "24h"
}
}
],
"transitions": [
{
"state_name": "warm",
"conditions": {
"min_rollover_age": "0h"
}
}
]
}
]
}
}
Attach the policy using POST _plugins/_ism/add/<index>. Rollover execution requires a single write alias pointing to the active index. Validate alias resolution before deployment: GET _aliases/<write_alias>. If the alias resolves to multiple indices, ISM rejects the rollover with a 400 Bad Request due to ambiguous write targets.
Asynchronous Scheduler Latency & Threshold Calibration
ISM evaluates thresholds asynchronously via the background scheduler. The plugins.index_state_management.job_interval setting defaults to 5m, creating a deterministic latency window where an index may exceed configured limits by up to 5 minutes before rollover triggers. When applying Threshold Tuning Strategies, engineers must set thresholds 10–15% below hard storage limits to absorb ingestion spikes during the evaluation window.
If ingestion velocity exceeds shard capacity before the next job cycle, bypass the scheduler and force immediate rollover. The _rollover API targets the write alias (optionally naming the new backing index):
POST /<write_alias>/_rollover
Verify the managed-index state immediately after execution:
GET _plugins/_ism/explain/<current_index>
If the index remains in the hot state, inspect the explain response’s action and step objects (step.status: "failed" and the info message) for diagnostics. Common blockers include alias misconfiguration, insufficient disk watermark clearance, or concurrent policy updates.
Python Orchestration & Automated Validation
Production deployments require automated policy deployment, threshold validation, and state polling. The following Python orchestration pattern uses opensearch-py to enforce idempotent configuration, implement exponential backoff retries, and parse ISM state responses. Refer to ISM Policy Implementation & Python Automation for extended framework integration.
import time
import logging
from opensearchpy import OpenSearch, exceptions, helpers
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
def deploy_and_validate_rollover(hosts, policy_id, index_pattern, max_retries=3):
client = OpenSearch(hosts=hosts, use_ssl=True, verify_certs=True)
policy_body = {
"policy": {
"description": "Automated size/age rollover",
"default_state": "hot",
"states": [{
"name": "hot",
"actions": [{"rollover": {"min_size": "45gb", "min_index_age": "24h"}}],
"transitions": [{"state_name": "warm", "conditions": {"min_rollover_age": "0h"}}]
}]
}
}
# opensearch-py has no `.ism` namespace; ISM is driven via transport.perform_request.
policy_url = f"/_plugins/_ism/policies/{policy_id}"
# Idempotent policy creation/update (updates require seq_no + primary_term).
try:
client.transport.perform_request("PUT", policy_url, body=policy_body)
logger.info(f"Policy {policy_id} deployed successfully.")
except exceptions.RequestError as e:
if "version_conflict" in str(e) or "already exists" in str(e):
existing = client.transport.perform_request("GET", policy_url)
client.transport.perform_request(
"PUT",
policy_url,
params={"if_seq_no": existing["_seq_no"], "if_primary_term": existing["_primary_term"]},
body=policy_body,
)
logger.info(f"Policy {policy_id} updated in place.")
else:
raise
# Attach the policy to matching indices (index goes in the path).
client.transport.perform_request(
"POST", f"/_plugins/_ism/add/{index_pattern}", body={"policy_id": policy_id}
)
# Poll for state transition with retry logic. The explain response returns
# index names as top-level keys plus a `total_managed_indices` summary field.
for attempt in range(max_retries):
try:
explain = client.transport.perform_request(
"GET", f"/_plugins/_ism/explain/{index_pattern}"
)
for idx_name, state_data in explain.items():
if not isinstance(state_data, dict):
continue # skip total_managed_indices
current_state = state_data.get("state", {}).get("name", "unknown")
if current_state == "hot":
logger.info(f"{idx_name} active in hot state. Thresholds enforced.")
return True
logger.warning("No indices matched pattern or state pending.")
except exceptions.ConnectionError:
logger.warning(f"Connection failed. Retry {attempt + 1}/{max_retries}")
time.sleep(2 ** attempt + 1)
return False
Execute the orchestration script during CI/CD pipeline validation or as a scheduled cron job. Monitor opensearch.log for IndexStateManagement thread warnings to detect scheduler drift or thread pool exhaustion.
CCR Topology & State Transition Debugging
Cross-Cluster Replication followers do not execute ISM policies. Rollover execution occurs exclusively on the leader cluster. When the leader index rolls over, the auto-follow pattern on the follower cluster detects the new index and initiates replication. Misaligned thresholds between leader and follower shard counts trigger cluster_block_exception or replication_failed errors.
Debug CCR desynchronization using:
GET _plugins/_replication/<follower_index>/_status
If the follower index stalls during rollover, verify that the leader’s write alias points to exactly one index and that the follower’s index.plugins.replication.replication_type remains DOCUMENT. Force follower resync by pausing and resuming replication:
POST _plugins/_replication/<follower_index>/_pause
POST _plugins/_replication/<follower_index>/_resume
For persistent 400 Bad Request errors during manual rollover, validate that index.number_of_shards matches across all indices in the alias chain. OpenSearch rejects rollover when shard topology diverges, as documented in the OpenSearch Index State Management documentation.
Automate threshold validation against cluster disk watermarks using the Python requests library or opensearch-py to query _cluster/settings. Ensure cluster.routing.allocation.disk.watermark.low remains at least 15% above your configured min_size to prevent allocation deadlocks during rollover execution.