Guide · HTTP probe
How to monitor a REST API the right way
curl -I https://api.example.com/health returns 200 OK
and your monitor stays green for hours while the database is
unreachable, the checkout endpoint times out, and the build that
went out last night still says it's the previous version. Real
REST API monitoring assumes the status code is the easiest thing
to fake — and asserts on everything else too.
Why status-code-only checks miss half the failures
The default uptime monitor — anybody's uptime monitor — does one
thing: it sends a GET, reads the response code, and
calls anything starting with a 2 healthy. That
covers the case where the server is hard down. It also covers
roughly half the real outages that page people in the middle of
the night.
The other half look like this:
-
200 OK with broken JSON. A serializer change
shipped, your API now returns
{"status:"ok"}(missing quote) on the health endpoint. Status code is fine. Every consumer that tries to parse the body explodes. -
200 OK in degraded mode. The health endpoint
catches the database exception, logs it, and returns
{"status":"degraded","db":"down"}with a 200 so the load balancer keeps the pod in rotation. Customers see 503s on every other endpoint. Your monitor doesn't. - Stale deploy. The rollout claimed to succeed, but one replica is still serving last week's build. Half your requests succeed, half fail with "unknown field". The health endpoint, served by both replicas, returns 200 either way.
- Slow drift. p50 latency has crept from 80 ms to 1.4 s over a week because an index is missing. Nothing is "down". Customers churn anyway.
-
Partial outage.
/healthreturns 200 because it doesn't touch the database./api/orderstimes out because it does. The status-code-only probe will cheerfully report "all green" through the entire incident.
All five of these are everyday failures. None of them trip a status-code check. Treat the response code as the floor of what you assert on, not the ceiling.
The five layers of an HTTP check
Every HTTP request to your API moves through five layers before the response code even exists. Any one of them can fail independently, and the failure mode tells you what to look at.
-
DNS resolves.
api.example.comturns into an IP. A misconfigured DNS record, an expired domain, a Route 53 outage, or a typo in a CNAME all stop the request before a single byte hits the wire. -
TCP connects. The probe opens a socket to
that IP on port 443. Firewalls, load-balancer health failures,
kernel
SYN_RECVqueue overflows, and the classic "security group forgot to allow the new region" all live here. - TLS handshakes. Certificate validity, hostname match, chain trust, protocol version, cipher overlap. A cert that expired at midnight UTC takes down every probe worldwide at the same instant.
-
HTTP responds. The server returns a status
line and headers. Anything in 5xx is a server failure;
anything in 4xx is usually a client / contract failure;
408and504are timeouts in disguise. - Body asserts. The body actually contains what you expect. This is the only layer that catches semantic failures — the 200 OK that lies.
A useful probe distinguishes between them. If your monitor just says "Down", you have to dig. If it says "TLS handshake failed: certificate expired 2 hours ago", you know exactly which engineer to wake up. StatusPulse stores the failure category per check so you can chart "DNS failures this week" separately from "5xx spikes this week" — they're usually different problems.
What to assert on the body
The body assertion is where status-code-only monitoring stops and real REST API monitoring starts. Four patterns cover most production needs.
JSON value matching
The default and most-used pattern: your health endpoint returns a small JSON document, and you assert on the value of a specific field.
$ curl -s https://api.example.com/health
{"status":"ok","db":"ok","cache":"ok","build":"v1.42.0"}
Assert that the body contains the literal substring
"status":"ok". When the API switches to
"status":"degraded" on a database failure — which is
the right thing for the API to do, by the way, so the load
balancer can keep serving cached reads — your probe flips Down.
A status-code-only probe would not.
Build hash for deploy verification
Embed the build hash in the health response and assert on it
after a deploy. The classic use case: you cut a release, want to
confirm every region is actually serving the new code, and don't
want to manually curl seventeen edges.
$ curl -s https://api.example.com/health | jq .build
"sha-7c3a9f1"
Add the new SHA to the body assertion. Any region still serving the old build trips Down within one probe interval. Reverse the pattern for a rollback: assert on the previous hash until the rollback is confirmed everywhere.
Version string for rollback detection
Same idea, lower resolution. "version":"1.42.0" in
the body, assertion on the substring, alert fires the moment a
replica drops back to 1.41.x. Useful when you want
human-readable versions rather than commit SHAs on your status
page.
Regex for warning markers
Some services don't switch status:ok to
status:degraded — they add a warnings
array. Treat the presence of any element in that array as a
Degraded signal even if the overall status is still ok:
{"status":"ok","warnings":["queue_lag_high"]}
A negative assertion ("body must not contain
warnings") catches the moment that array becomes
non-empty. The probe stays green when warnings clear, flips to
Degraded the moment they appear.
Whatever you assert on, pick a stable marker. Don't assert on timestamps, session IDs, CSRF tokens, or anything that varies per request. The HTTP probe documentation covers the body-size limit (64 KB) and the case-sensitivity rule — if your marker lives past 64 KB into a response, the probe never sees it.
Headers matter, both ways
Headers are where API monitoring gets opinionated. Three header categories deserve attention.
Auth tokens for protected endpoints
Plenty of teams expose /health publicly. Plenty of
others — for very good reasons — don't. If your health endpoint
is behind a bearer token, the probe needs to send it:
$ curl -H "Authorization: Bearer eyJhbGciOi…" \
https://api.example.com/health
Use a long-lived, narrowly-scoped probe credential rather than a user JWT — the probe will replay this header every five minutes forever, and you don't want it tied to a person who might leave the company. Store the value in the probe's encrypted-headers field, never in the URL.
X-Probe for upstream identification
Send a constant header identifying the request as a monitor, not a real client:
X-Probe: statuspulse
X-Request-Source: monitoring
Your APM filters out probe traffic from latency percentiles. Your access logs let you grep for probe-only failures. Rate limiters can exempt the probe so a transient 429 doesn't page on-call. A three-character header is the cheapest distinguishing signal you will ever ship.
Correlation IDs for log search
Send a header like X-Probe-Run: $RUN_ID so the
server can echo it into its logs. When a check fails, the probe
response error message contains the ID; pasting it into your log
search jumps straight to the request that broke. This is the
difference between "the monitor went red at 03:14" and "the
monitor went red at 03:14 because of this request".
Content negotiation
Send Accept: application/json explicitly if you
assert on JSON. A server that defaults to HTML when no
Accept is supplied (older Rails, default Spring) will
happily return <!doctype html> to a probe that
forgot to ask for JSON, and your "status":"ok"
assertion will always fail, in a confusing way.
Pick the right method
Almost every probe should be a GET. The exceptions
are interesting enough to know about.
-
GETfor idempotent health. The default. Cheap, cacheable (which you usually want to disable — see pitfalls), safe to call from anywhere. -
POSTto exercise the write path. A read-only health check confirms the API is up; it doesn't confirm writes work. APOST /probeendpoint that runs a one-row insert into a probe table, then deletes it, will catch the read-only-replica-promoted-as-primary case a GET will not. Pair it with a body assertion on the inserted row's ID echo. -
HEADfor cheap availability. When the body is large (download endpoints, file redirectors),HEADgets you the status, the content-length, and the headers without the bandwidth. Useful for monitoring static asset reachability without downloading a gigabyte every five minutes. -
OPTIONSfor CORS sanity. If your frontend talks toapi.example.comfromapp.example.com, a CORS preflight failure breaks the product even though direct curl works fine. AnOPTIONSprobe with anOrigin: https://app.example.comheader and a body assertion on the returnedAccess-Control-Allow-Originheader in the response catches that.
For services that don't speak HTTP at all — gRPC backends in
particular — an HTTP /healthz in front of the gRPC
process lies systematically: the HTTP layer can be healthy while
the gRPC service is rejecting RPCs. Use the
gRPC Health probe guide
instead. HTTP probes are for HTTP services, full stop.
Latency thresholds — not just up/down
Binary up/down monitoring misses the most common production failure mode: things slowing down. A REST API that took 120 ms last week and 1.8 s this week is in trouble, even if every response is technically 200 OK.
Three habits worth adopting.
- Use a Degraded state, not just Down. StatusPulse fires Degraded when the response is fine but the latency exceeded your threshold. Treat Degraded as "look at this, don't wake anyone yet" and Down as "page someone". A 15-minute Degraded streak that doesn't recover is almost always the start of an incident.
- Tolerate cold-start spikes. Serverless APIs and autoscaled fleets will hit cold paths occasionally. A single 4-second response after 10 minutes of idle isn't an outage. Use a consecutive-checks rule (StatusPulse's default: 2 consecutive bad checks before flipping state) so one cold start doesn't page anyone.
- Watch p50 and p99 separately. p50 climbing from 80 ms to 200 ms means everyone is slower. p99 climbing from 400 ms to 4 s with stable p50 means a fraction of requests hit something pathological — a missing index on a rare code path, a noisy neighbour, a tail latency from a downstream. Both matter. Both deserve their own threshold.
A 1-minute probe interval gives you 1,440 samples per day per region — plenty for a stable p99 over a 24-hour window. Below that, percentiles get noisy.
Common pitfalls that bite in production
Five mistakes that cost teams real incidents. Read them once and you'll avoid most of the embarrassment.
Probing through a CDN cache
If api.example.com sits behind Cloudflare or
CloudFront and the health endpoint isn't explicitly excluded
from caching, your probe is monitoring the CDN cache, not the
origin. The origin can be on fire for 30 minutes and the cached
200 OK {"status":"ok"} keeps your monitor green
until the TTL expires.
Two fixes. Either send Cache-Control: no-cache on
the probe request and configure the CDN to respect it; or expose
a separate /health/origin path with a CDN rule that
bypasses the cache for that exact URL. The probe's job is to
tell you about the origin. Don't let a middle layer silence it.
Health endpoints that don't touch the database
A startling number of production /healthz endpoints
return {"status":"ok"} from a hard-coded handler
that doesn't read or write anything. The pod responds, so
Kubernetes leaves it in rotation. The database is unreachable,
so every real endpoint 500s. The monitor stays green.
A healthcheck should touch every dependency that, if broken,
breaks the product. At minimum: a SELECT 1 against
the primary database. Adding the same query against the cache,
the queue, and any downstream API is worth the extra few
milliseconds. The
Postgres monitoring
guide covers what to assert on the database side once you've
got that wired up.
Not following redirects when you should
You probe http://example.com. The server redirects
to https://example.com. Your probe records a 301,
calls it healthy, and never notices when the HTTPS endpoint
itself starts returning 500. Either probe the final HTTPS URL
directly, or enable follow-redirects so the probe asserts on
the actual destination. Don't measure the redirect step.
The mirror failure: probing the HTTPS URL but having a body assertion that matches the redirect page's body. The redirect keeps working forever, the assertion keeps passing forever, and you never notice the real endpoint is broken. Always assert on content from the page you actually care about.
Rate-limit collisions at high frequency
A 30-second probe interval across 4 regions is 11,520 hits per day. An API with a 10,000-request-per-day per-IP rate limit will start returning 429s before the day is over, your monitor flips Down, and the on-call engineer wakes up to discover the outage is the probe itself.
Either raise the API's rate limit for the probe's source IPs,
exempt the X-Probe header from the limiter
entirely, or drop the probe interval to a sustainable number.
Don't argue with the rate limiter at 03:00.
Probing the wrong layer when the user-visible failure is elsewhere
/health on the API is necessary but not sufficient.
Add probes for the workflows users actually care about —
login, checkout, the search endpoint. A single probe on
/api/orders with a body assertion on the expected
shape will tell you about the partial outages a /health
probe is structurally blind to. StatusPulse's free plan
includes five probes — that's enough to cover health, login,
and three critical read endpoints on the same status page.
Wrap-up
Four things to take away:
-
Status code 200 is the easiest signal to fake, accidentally
or otherwise. Assert on the body. Pick a stable marker —
"status":"ok", a build hash, a version string, a warnings-absent regex — and let the body do the talking. - Every HTTP check has five layers: DNS, TCP, TLS, HTTP, body. Failure looks different at each one. A probe that distinguishes between them saves you the first ten minutes of every incident.
-
Pick the right method.
GETfor most things,POSTwhen you want to exercise the write path,OPTIONSfor CORS,HEADfor big payloads. Send headers that identify the probe and let your logs filter it out. - Don't probe through a cache. Don't probe a health endpoint that doesn't touch the database. Don't argue with the rate limiter. The pitfalls list is short, and every item on it has cost someone real downtime.
StatusPulse's HTTP probe ships with all of this baked in: five-layer failure categorisation, optional body assertions with 64 KB scan budget, custom headers stored encrypted at rest, Degraded latency thresholds, multi-region quorum, and a public status page in front of it. Free tier covers five probes, which is enough to monitor a small REST API the right way before you spend a dollar.
Try StatusPulse's HTTP probe
5 probes, 1 status page, forever. No credit card. US or EU host — you choose.