Writing Python scripts for automated ISM rollover triggers

Native OpenSearch Index State Management (ISM) evaluates rollover on a background sweep that defaults to five minutes, so this walkthrough builds a Python script that forces an immediate, deterministic roll the moment a hot index crosses a size threshold. It is the deep implementation behind the deploy-and-verify pattern introduced in Rollover Trigger Configuration, aimed at high-throughput logging pipelines where the sweep latency lets a write shard overshoot past its recovery-safe ceiling before ISM ever acts.

The script does not replace the ISM policy — the policy still owns tiering and deletion under the ISM Policy Implementation & Python Automation model. It runs alongside the policy as an out-of-band trigger that calls the _rollover API against the write alias when your own metric evaluation fires sooner than the next sweep would. Because the API is idempotent against a write alias, the native sweep and the external trigger can coexist without ever double-rolling.

Prerequisites

Confirm each of these before running the script — a missing write alias or an over-broad service-account role is the usual reason the first execution fails silently.

A rollover-ready index template is deployed with plugins.index_state_management.rollover_alias set, and the first backing index (logs-000001) is bootstrapped with is_write_index: true — the setup covered in Rollover Trigger Configuration.
The write alias resolves to exactly one index; a multi-index alias makes the write target ambiguous and _rollover returns 400.
Hot nodes carry the routing attribute the bootstrapped index targets, per Node Role Allocation, so the newly created index can actually allocate.
A scoped service account exists whose role grants only indices:admin/rollover, cluster:monitor/*, and read on the alias pattern — endpoint scoping is detailed in Security & Access Boundaries.
python >= 3.11 and aiohttp are installed (asyncio.timeout() used below is a 3.11+ context manager).
In a Cross-Cluster Replication (CCR) topology, the follower is confirmed reachable so the script can verify replication before it rolls the leader.

Step-by-step procedure

The procedure builds the script bottom-up: unit parsing first, then a pooled async client, then the guarded rollover call, then the evaluation loop, and finally the CronJob that runs it. Every block is standard library plus aiohttp.

1. Normalize OpenSearch size units to raw bytes

_cat/indices returns store.size as a human-readable string like 48.7gb. Comparing those strings, or trusting a float, misfires at boundary conditions. Map each suffix to an explicit integer multiplier and convert everything to bytes.

Python

UNIT_MULTIPLIERS = {"kb": 1024, "mb": 1024**2, "gb": 1024**3, "tb": 1024**4}

def parse_size_to_bytes(size_str: str) -> int:
    if not size_str:
        return 0
    size_str = size_str.strip().lower()
    for unit, mult in UNIT_MULTIPLIERS.items():
        if size_str.endswith(unit):
            return int(float(size_str[:-len(unit)]) * mult)
    return int(float(size_str))  # bare bytes, no suffix

Expected behaviour on representative inputs:

Text

parse_size_to_bytes("48.7gb")  -> 52293093785
parse_size_to_bytes("512mb")   -> 536870912
parse_size_to_bytes("900")     -> 900

Gotcha: check suffixes longest-first is not required here because kb/mb/gb/tb are all two characters, but if you add b as a bare-byte suffix, order matters — gb would match b first and corrupt the value. Keep the bare-byte case in the fallthrough as shown.

2. Build a pooled, timeout-bounded async client

Synchronous calls block when a hop is slow or the OpenSearch cluster is yellow during shard init. Use aiohttp with a bounded connector and an explicit total timeout so the script fails fast instead of hanging a CronJob slot.

Python

import aiohttp

def make_session(auth: aiohttp.BasicAuth) -> aiohttp.ClientSession:
    connector = aiohttp.TCPConnector(limit=100, limit_per_host=10, enable_cleanup_closed=True)
    timeout = aiohttp.ClientTimeout(total=15)
    return aiohttp.ClientSession(connector=connector, timeout=timeout, auth=auth)

Gotcha: limit_per_host=10 keeps the script from exhausting the OpenSearch coordinating thread pool when several evaluation loops run concurrently across a fleet. Do not raise it to match your CronJob parallelism — cap parallelism at the scheduler instead.

3. Fetch live shard metrics for the write alias

Query the lightweight _cat endpoint scoped to the alias pattern and pull only the columns you compare on. This is a cluster:monitor read, not an admin call.

Python

async def fetch_index_sizes(session, host, alias):
    url = f"{host}/_cat/indices/{alias}*?format=json&h=index,docs.count,store.size,health"
    async with asyncio.timeout(10):
        async with session.get(url) as resp:
            resp.raise_for_status()
            return await resp.json()

Expected response shape:

JSON

[
  { "index": "logs-000007", "docs.count": "94128841",
    "store.size": "48.7gb", "health": "green" }
]

Gotcha: _cat/indices/<alias>* matches backing-index names, which is why the bootstrap suffix convention (logs-00000N) matters — an alias with no numeric backing indices returns [] and the loop below does nothing.

4. Roll over with exponential backoff and an idempotency guard

The _rollover API is effectively idempotent against a write alias: if conditions are not met it returns 200 with "rolled_over": false and changes nothing, so safe re-polling never over-rotates. Inspect the rolled_over flag rather than the status code, and treat 409 Conflict (target index already exists) as a real error.

Python

async def rollover_with_retry(session, url, max_retries=3, base_delay=2.0):
    payload = {"conditions": {"max_size": "50gb", "max_age": "1d"}}
    for attempt in range(max_retries):
        try:
            async with asyncio.timeout(10):
                async with session.post(url, json=payload) as resp:
                    if resp.status == 200:
                        result = await resp.json()
                        if result.get("rolled_over"):
                            logging.info("rolled %s -> %s",
                                         result["old_index"], result["new_index"])
                        else:
                            logging.info("conditions not yet met; no action")
                        return True
                    body = await resp.text()
                    logging.warning("attempt %d: %s - %s", attempt + 1, resp.status, body)
        except (aiohttp.ClientError, TimeoutError) as exc:
            logging.error("network error attempt %d: %s", attempt + 1, exc)
        if attempt < max_retries - 1:
            await asyncio.sleep(base_delay * (2 ** attempt))
    return False

Expected log line on a successful roll:

Text

2026-07-04 09:14:02 [INFO] rolled logs-000007 -> logs-000008

Gotcha: passing the same conditions block that the ISM policy already enforces makes the external call a no-op guard rather than a forced roll — the server re-checks them. Drop conditions entirely (POST an empty body) if you want your Python threshold to be the sole authority and force the roll unconditionally.

5. Assemble the evaluation loop

Fetch metrics, convert to bytes, and trigger on the first index that breaches the threshold. Break after one successful roll so a single evaluation cycle never fires twice.

Python

import asyncio, os, logging

logging.basicConfig(level=logging.INFO, format="%(asctime)s [%(levelname)s] %(message)s")

HOST = os.environ["OPENSEARCH_HOST"]
ALIAS = os.environ.get("WRITE_ALIAS", "logs-write")
MAX_SIZE_BYTES = 50 * 1024**3  # 50 GiB

async def evaluate_and_trigger():
    auth = aiohttp.BasicAuth(os.environ["OS_USER"], os.environ["OS_PASS"])
    async with make_session(auth) as session:
        try:
            indices = await fetch_index_sizes(session, HOST, ALIAS)
        except Exception as exc:
            logging.error("metric fetch failed: %s", exc)
            return
        for idx in indices:
            size = parse_size_to_bytes(idx.get("store.size", "0"))
            if size >= MAX_SIZE_BYTES:
                logging.info("%s at %d bytes >= threshold; triggering", idx["index"], size)
                if await rollover_with_retry(session, f"{HOST}/{ALIAS}/_rollover"):
                    break

if __name__ == "__main__":
    asyncio.run(evaluate_and_trigger())

The maximum an index can overshoot the threshold before this loop catches it is bounded by the ingest rate $R$ and the interval $\Delta t$ between CronJob runs:

\text{overshoot}_\text{max} = R \times \Delta t

Set $\Delta t$ so that overshoot stays well inside your hot-node disk headroom; a one-minute schedule against a 200 MB/s stream tolerates about 12 GB of overshoot. Calibrating that headroom against real ingest is the subject of Configuring index size and age thresholds for rollover.

6. Schedule it as a Kubernetes CronJob

Run the script on a fixed cadence with overlap prevention, and inject the service-account credentials from a Secret rather than baking them into the image.

YAML

apiVersion: batch/v1
kind: CronJob
metadata:
  name: ism-rollover-trigger
spec:
  schedule: "*/1 * * * *"        # every minute; tighter than the 5m ISM sweep
  concurrencyPolicy: Forbid       # never overlap evaluation cycles
  jobTemplate:
    spec:
      backoffLimit: 1
      template:
        spec:
          restartPolicy: Never
          containers:
            - name: trigger
              image: registry.internal/ism-rollover:1.4.0
              envFrom:
                - secretRef: { name: opensearch-rollover-creds }
              env:
                - { name: OPENSEARCH_HOST, value: "https://opensearch.data.svc:9200" }
                - { name: WRITE_ALIAS, value: "logs-write" }

Gotcha: concurrencyPolicy: Forbid plus the break in step 5 is what keeps concurrent runs from racing to roll the same alias — do not rely on either one alone. For the CI/CD packaging of this image, see Python Orchestration Frameworks.

Verification

After the first scheduled run, confirm the trigger is firing against the right alias and that ISM still sees the index as managed. Run these three checks.

Shell

# 1. Did the generation advance past the bootstrap index?
GET _cat/indices/logs-*?v&s=index
# health status index        docs.count store.size
# green  open   logs-000008  102          6.1mb        <- new write index exists

Shell

# 2. Is the freshly created index still managed by the ISM policy?
GET _plugins/_ism/explain/logs-write
# "logs-000008": { "state": { "name": "hot" }, "action": { "failed": false } }

Shell

# 3. In a CCR topology, has the follower replicated the new backing index?
GET _plugins/_replication/logs-000008/_status
# "status": "SYNCING"   -> follower is catching up; "BOOTSTRAPPING" briefly is normal

A healthy result shows the highest-numbered index as the write target in hot state with "failed": false, and — on CCR — a follower status that is SYNCING or SYNCED, never FAILED. If the leader rolled but the follower lags, do not force it forward; let replication settle before scaling down hot-tier capacity, and pair this trigger with Handling async ISM policy execution failures so a stalled follower does not strand the write path.

Common failures

Symptom	Root cause	Fix command
`_rollover` returns `400 illegal_argument`	Write alias resolves to more than one index — write target is ambiguous	`GET logs-write/_alias` then repoint `is_write_index` to a single index via `POST /_aliases`
Script logs “conditions not yet met” while the shard is clearly oversized	The `conditions` block sent to `_rollover` is re-checked server-side and disagrees with your Python threshold	Send an empty `_rollover` body so the Python threshold is the sole authority
`403 security_exception` on `_rollover`	Service-account role lacks `indices:admin/rollover`	Grant the action per Security & Access Boundaries; re-run
Generation never advances; loop finds `[]` indices	Alias has no numeric backing index, so `_cat/indices/logs-*` matches nothing	Bootstrap `PUT logs-000001` with `"aliases": {"logs-write": {"is_write_index": true}}`
Roll succeeds but the new index sits `UNASSIGNED`	Hot nodes lack the routing attribute the bootstrapped index requires	Reconcile node attrs per Node Role Allocation; check `_cluster/allocation/explain`

Observability

Route stdout to your central logging pipeline and emit two counters so a silent misfire is visible: opensearch_rollover_triggered_total (incremented only when rolled_over is true) and opensearch_rollover_evaluation_duration_seconds. Alert when the trigger count is zero across a window in which _cat/indices shows the write index exceeding the threshold — that gap means the script is running but not rolling, usually the ambiguous-alias or conditions-mismatch case above. Correlate the script’s timestamps against _plugins/_ism/explain output to separate network latency from threshold misalignment.

Frequently asked questions

Will the external trigger and the native ISM sweep double-roll the index?

No. _rollover is idempotent against a write alias: whichever caller fires first creates the new backing index and re-points the alias; the next caller — script or sweep — sees conditions already satisfied on a fresh, near-empty index and returns "rolled_over": false. The two mechanisms converge on the same alias safely.

Should I send a `conditions` block or an empty body to `_rollover`?

Send an empty body when you want your Python threshold to be the authority — the roll then fires unconditionally. Send a conditions block when you want the script to act only as an early nudge and let the server veto rolls that do not meet the policy’s own thresholds. The script above ships with conditions for safety; drop them for a hard forced roll.

Why not just lower `plugins.index_state_management.job_interval` instead of scripting this?

Tightening the sweep interval shortens overshoot but applies cluster-wide and adds evaluation load for every managed index, not just the hot write index. An external trigger scoped to one alias gives sub-minute cadence on the index that needs it without taxing the scheduler for the whole cluster.

Configuring index size and age thresholds for rollover — calibrating the byte and age values this script fires on.
Handling async ISM policy execution failures — recovering when a rolled index stalls behind CCR replication.
Python automation for dynamic ISM policy updates — packaging and shipping this trigger through CI/CD.

Up: Rollover Trigger Configuration

Writing Python scripts for automated ISM rollover triggers

Prerequisites #

Step-by-step procedure #

1. Normalize OpenSearch size units to raw bytes #

2. Build a pooled, timeout-bounded async client #

3. Fetch live shard metrics for the write alias #

4. Roll over with exponential backoff and an idempotency guard #

5. Assemble the evaluation loop #

6. Schedule it as a Kubernetes CronJob #

Verification #

Common failures #

Observability #

Frequently asked questions #

Related #

Prerequisites

Step-by-step procedure

1. Normalize OpenSearch size units to raw bytes

2. Build a pooled, timeout-bounded async client

3. Fetch live shard metrics for the write alias

4. Roll over with exponential backoff and an idempotency guard

5. Assemble the evaluation loop

6. Schedule it as a Kubernetes CronJob

Verification

Common failures

Observability

Frequently asked questions

Related