← All guides

Guide · Cron monitoring

Cron heartbeat patterns that catch silent failures

Exit-code monitoring is fine for the loud failures. The two failure modes that take down your business — jobs that ran but did nothing, and jobs that didn't run at all — need a different shape of probe. This is the cron job monitoring pattern that catches both.

Published 2026-05-22 · ~9 min read · StatusPulse Team

The three failure modes of a cron job

Every team that has been running cron for more than a year has a war story that ends with "and we only found out three weeks later". They almost always trace back to one of three failure modes, and they are not equally easy to catch:

  • (a) Ran and failed loudly. The script exits non-zero, cron emails root, your log aggregator sees a stack trace, your APM fires an alert. This is the easy case. Any half-decent monitoring picks it up.
  • (b) Ran but did nothing useful. The script thinks it succeeded. It exited zero, no exceptions were thrown — but a misconfigured filter selected zero rows, an empty input bucket returned an empty list, an upstream API silently returned a 200 with no data. The backup ran for 0.4 seconds and uploaded an empty tarball. This is the partial-success failure, and it is the most common cause of "we discovered the broken pipeline three weeks later".
  • (c) Didn't run at all. The cron daemon wasn't restarted after the package upgrade. The Kubernetes CronJob's imagePullPolicy picked up a broken image and the pod has been crash-looping for a month. The GitHub Actions runner pool ran out of minutes. The SYSTEMD_TIMER never started because the unit file had a typo. This is the killer. The job is just gone and the silence is indistinguishable from "nothing interesting happened".

Mode (a) is the easy 20% of incidents. Modes (b) and (c) are the 80% — and they are exactly what exit-code monitoring misses.

Why exit-code monitoring misses two of them

The default mental model is: "my job exits zero on success, non-zero on failure, and I alert on the non-zeros." That model is missing two assumptions.

For mode (b): exit-zero only tells you the process didn't throw. It says nothing about whether the process did the thing it was supposed to do. A backup script can cheerfully exit 0 after the find in front of tar returned no files. A reconciliation job can exit 0 after processing zero rows because the cursor was already at the end of the table. An export script can exit 0 after writing an empty CSV because the SQL filter was wrong. Every one of those is an outage and every one of those passes exit-code monitoring.

For mode (c): there is no exit code to monitor when the process never started. If you are watching for non-zero exits, the absence of any exit at all is the loudest possible failure and the quietest possible signal — there is literally nothing to alert on. The job stops running, the inbox stops receiving cron mail, and everything looks fine.

What you need is a probe that flips the polarity: instead of alerting when something bad happened, alert when something good stopped happening. That's a heartbeat.

The heartbeat pattern: absence is the signal

Every probe in a typical monitoring tool — HTTP, TCP, DNS, certificate expiry, database SELECT 1 — is pull-based. The monitor reaches out to your service and checks the response. That's the wrong shape for cron jobs, because the job has no inbound network surface to poll. The heartbeat pattern inverts the flow.

Each cron job gets a unique URL. At the end of every successful run, the job POSTs to that URL. The monitor knows the expected schedule, and it tracks the time since the last successful ping. If the ping doesn't arrive within schedule + grace_period, the monitor flips Down and pages your on-call.

The mechanics fall out of one observation: the absence of a signal is the signal. There is no "I'm failing!" message to send when the cron daemon dies, no exit code to capture when the pod was never scheduled. The only piece of information that survives every failure mode is "a ping we expected didn't arrive", and that's what the heartbeat probe is built around.

StatusPulse's Heartbeat probe implements this directly: a 32-byte CSPRNG token per probe, a unique URL of the form https://api.statuspulse.ai/api/heartbeat/<token>, a configurable expected schedule (simple interval or 5-field cron), and a configurable grace period. Each successful POST resets the "next expected" clock. The Heartbeat probe is on the Starter plan and above ($5/mo) — not on Free, where the probe budget is reserved for outbound HTTP / SSL / TCP. If you've used Cronitor or Healthchecks.io, this is the same shape; if you're comparing tools, our UptimeRobot comparison covers how heartbeat support differs across uptime vendors (UptimeRobot has it, but capped and without partial-success payloads).

Concrete recipes

The pattern is uniform: run the job, on success call the heartbeat URL, on failure either skip the call (the absence will alert) or explicitly POST a failure payload. The syntax varies by platform.

Linux cron + curl

The classic case. Daily backup at 03:00 UTC. The && is load-bearing — it means curl runs only if the backup exited zero. A semicolon would ping unconditionally, which is exactly the silent-success failure we're trying to catch.

0 3 * * * /usr/local/bin/run-backup && \
  curl -fsS --retry 3 https://api.statuspulse.ai/api/heartbeat/<token>

The -fsS flag combo is non-negotiable. -f makes curl exit non-zero on a 5xx (without it, curl exits 0 on a server error and your heartbeat reports "fine" even when StatusPulse returned an error). -s silences the progress meter so cron doesn't email you every night. -S re-enables error output so when something does break, you see it. --retry 3 absorbs transient network blips on the path between your server and the receiver.

Kubernetes CronJob

Pod-restart loops and bad image pulls can break a CronJob without producing any signal in your APM. Heartbeat catches the silence. Two patterns work — inline curl, or a sidecar that runs after the main container. The inline version is simpler and almost always enough:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: scrape-upstream
spec:
  schedule: "*/15 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: OnFailure
          containers:
            - name: scrape
              image: ghcr.io/acme/scrape:1.4.2
              command: ["/bin/sh", "-c"]
              args:
                - |
                  /app/scrape.sh && \
                  curl -fsS --retry 3 "$HEARTBEAT_URL"
              env:
                - name: HEARTBEAT_URL
                  valueFrom:
                    secretKeyRef:
                      name: statuspulse-heartbeat
                      key: url

Store the URL as a Secret — it's a bearer credential. Anyone with the URL can ping it and keep your probe falsely Up. Treat it like an API key.

GitHub Actions scheduled workflow

cron monitoring for GitHub Actions is its own sub-problem: Actions has no built-in alerting for a workflow that simply stops running. If GitHub disables your scheduled workflow after 60 days of repo inactivity (they will), or the runner pool runs out, your workflow is gone and you find out in the next quarterly review.

name: nightly-export
on:
  schedule:
    - cron: '15 2 * * *'
jobs:
  export:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run export
        run: ./scripts/export.sh
      - name: Ping StatusPulse on success
        if: success()
        run: curl -fsS --retry 3 "$STATUSPULSE_HEARTBEAT_URL"
        env:
          STATUSPULSE_HEARTBEAT_URL: ${{ secrets.STATUSPULSE_HEARTBEAT_URL }}

The if: success() guard is the Actions equivalent of cron's && — the step runs only when every previous step in the job succeeded. Don't use if: always() here unless you also POST an explicit failure payload (see partial-success below), because always() turns the heartbeat into "the workflow file parsed", which is not what you want to assert.

Windows Task Scheduler + PowerShell

Same pattern, different syntax. Register the script below as the action of a scheduled task; Task Scheduler hands stdout and stderr off like any other task.

$ErrorActionPreference = "Stop"
try {
    & "C:\Scripts\cleanup.ps1"
    Invoke-WebRequest -Uri $env:HEARTBEAT_URL `
                      -Method POST -UseBasicParsing -TimeoutSec 10
} catch {
    Invoke-WebRequest -Uri $env:HEARTBEAT_URL `
                      -Method POST -ContentType "application/json" `
                      -Body '{"success":false,"message":"cleanup threw"}' `
                      -UseBasicParsing -TimeoutSec 10
    throw
}

Airflow DAGs

Two options. The simplest: a final BashOperator task that fires curl, with trigger_rule=TriggerRule.ALL_SUCCESS so it only runs when every upstream task in the DAG succeeded.

from airflow import DAG
from airflow.operators.bash import BashOperator
from airflow.utils.trigger_rule import TriggerRule

with DAG("nightly_export", schedule="15 2 * * *", catchup=False) as dag:
    export = BashOperator(task_id="export", bash_command="./export.sh")
    heartbeat = BashOperator(
        task_id="heartbeat",
        bash_command="curl -fsS --retry 3 $HEARTBEAT_URL",
        trigger_rule=TriggerRule.ALL_SUCCESS,
    )
    export >> heartbeat

The cleaner option: a DAG-level on_success_callback that hits the URL, with a matching on_failure_callback that POSTs the failure payload. That keeps the DAG graph tidy and gets you the partial-success payload pattern for free.

Partial-success POSTs for "ran but processed nothing"

The recipes above catch mode (c) — the job didn't run. Mode (b) — the job ran but processed zero of the thousand items it should have — needs one more piece. Empty POST bodies say "I'm alive", but they say nothing about whether the run was actually useful. The fix is to send a tiny JSON payload with the run's outcome counts.

RESULT=$(/usr/local/bin/run-import)   # prints e.g. "processed=842 skipped=12"
PROCESSED=$(echo "$RESULT" | grep -oP 'processed=\K\d+')

curl -fsS --retry 3 \
  -H "Content-Type: application/json" \
  -d "{\"processed\":$PROCESSED}" \
  https://api.statuspulse.ai/api/heartbeat/<token>

With Capture payload enabled on the probe (it's off by default — turn it on intentionally because the body is stored up to 4 KB, and you don't want PII in there), the receiver keeps the JSON. You can then either eyeball the recent payloads in the probe's history, or wire a downstream rule: if the processed count is below a threshold for N consecutive runs, POST to …/<token>/fail from the job itself and the probe flips Down with the message attached.

The pattern that works in production: every job ends with either a success payload (with counts) or, if the counts are implausibly low, an explicit failure payload. The /fail endpoint flips the probe Down immediately with the supplied message surfaced in the alert. That covers mode (b) properly — a job that ran and processed zero rows tells you so, in the same channel as a job that crashed.

Grace windows and clock drift

The grace period is the window between "expected ping time" and "we flip Down". Setting it too tight is the most common way teams sour on heartbeat monitoring — they get false-Down alerts every other week, learn to ignore the alerts, and miss the real one.

Real-world cron has jitter. A backup that takes 5 minutes most nights can take 25 the night a large customer onboarded. Kubernetes CronJobs add pod-scheduling, image-pull, and node-pressure delays that can run to minutes on a cold node. GitHub Actions runners can take 2-5 minutes to pick up a job under load. The cron daemon itself drifts — clock-skewed VMs on a busy hypervisor can be a minute off, and DST shifts local-time crons by an hour twice a year.

Useful defaults that don't false-alarm:

  • Daily / nightly jobs (backups, ETLs): 10-60 minutes of grace. The job's actual runtime variance is the floor; add a comfortable margin for runtime regressions and infrastructure jitter.
  • Hourly jobs: 5-10 minutes.
  • Every-15-minutes jobs: 2-5 minutes.
  • High-frequency (every 1-5 minutes): 60-120 seconds. Tighter than that is fine if you also POST partial-success counts — the absence-of-payload signal catches outages even when the timing tolerance is loose.

Schedule the cron expression itself in UTC and stop thinking about local time for infrastructure. DST shifts have caused more 4am pages than they have caught real incidents.

What to alert on, what to ignore

Three rules that survive contact with a real on-call rotation:

  • Alert on grace-period silence. This is the whole point of the probe. The ping didn't arrive within expected + grace — page someone. The probe goes Down, your watchers get the usual email / Slack / Teams / SMS notification, and the incident exists in the status page record.
  • Ignore single failed POSTs unless N in a row. A transient network blip between your runner and the receiver should not page you. The retry flag (--retry 3 on curl, equivalent on every other client) absorbs the blip. Set the alert threshold on missed schedules, not on individual HTTP failures from the job to the receiver.
  • Dedup duplicate runs from retries. If your workflow retries on failure (Actions has it built in, Kubernetes Jobs have backoffLimit, Airflow has retries=3), several pings can arrive within the same scheduled window. That's fine — the receiver treats them as idempotent, the probe stays Up after the first one. The rate-limit (200 requests per 10 minutes per token-hash) is the only ceiling, and you should not be anywhere near it.

One pattern to avoid: alerting the on-call on every individual job failure (the /fail POST) and also on missed schedules. That doubles the noise. Pick one channel for "the job tried and failed" and one for "the job didn't try" — most teams route the two to different severities, with missed schedules being the higher one because mode (c) is harder to recover from than mode (a).

Wrap-up

Exit-code monitoring is fine for the loud failures. The silent ones — the empty backup, the dead cron daemon, the disabled GitHub Actions workflow, the crash-looping CronJob — need a probe that asserts "something good happened recently" rather than "nothing bad happened". Heartbeats are that probe, and the payload-extended version (partial-success POSTs) covers the third failure mode too.

The recipe is uniform across platforms: run the job, on success call a unique URL, on failure either skip the call or POST an explicit failure. The differences across cron, Kubernetes, Actions, Task Scheduler, and Airflow are syntactic. The discipline is in the details: && not ;, -fsS on every curl, a grace period that respects real-world jitter, and treating the URL like the bearer credential it is. Once it's wired, the silent-failure class of incident collapses — you stop discovering broken pipelines three weeks late, and the discovery moves to the next morning, where it belongs.

Try StatusPulse's Heartbeat probe

5 probes free; Heartbeat probe from Starter ($5/mo). US or EU host — you choose.