{"id":254560,"date":"2025-12-27T19:57:09","date_gmt":"2025-12-27T19:57:09","guid":{"rendered":"https:\/\/www.europesays.com\/ie\/254560\/"},"modified":"2025-12-27T19:57:09","modified_gmt":"2025-12-27T19:57:09","slug":"the-3-a-m-call-that-changed-the-way-i-design-apis","status":"publish","type":"post","link":"https:\/\/www.europesays.com\/ie\/254560\/","title":{"rendered":"The 3 a.m. Call That Changed The Way I Design APIs"},"content":{"rendered":"<p>At 3:17 a.m. on a Tuesday, my phone buzzed with the alert that would reshape the way I think about API design.<\/p>\n<p>Our customer-facing API had stopped responding. Not slowly degrading; it was completely dead. Three downstream services went with it. By the time I got to my laptop, customer support tickets were flooding in.<\/p>\n<p>The root cause? A single database replica had gone down, and our API had no fallback. One failure cascaded into total unavailability. I spent the next four hours manually rerouting traffic while our customers waited.<\/p>\n<p>That night cost us $14,000 in service-level agreement (SLA) credits and a lot of trust. But it taught me something I now apply to <a href=\"https:\/\/thenewstack.io\/why-api-first-matters-in-an-ai-driven-world\/\" class=\"local-link\" rel=\"nofollow noopener\" target=\"_blank\">every API I build<\/a>: Every design decision should pass what I call \u201cThe 3 a.m. Test.\u201d<\/p>\n<p><strong>The 3 a.m. Test<\/strong><\/p>\n<p>The test is simple: When this system breaks at 3 a.m., will the <a href=\"https:\/\/thenewstack.io\/holiday-on-call-duty-a-present-or-punishment\/\" class=\"local-link\" rel=\"nofollow noopener\" target=\"_blank\">on-call engineer<\/a> be able to diagnose and fix it quickly?<\/p>\n<p>This single question has eliminated a surprising number of \u201cclever\u201d design choices from my architectures:<\/p>\n<ul>\n<li aria-level=\"1\">Clever error codes that require documentation lookup? Fail.<\/li>\n<li aria-level=\"1\">Implicit state that depends on previous requests? Fail.<\/li>\n<li aria-level=\"1\">Cascading failures that take down unrelated features? Fail.<\/li>\n<\/ul>\n<p>After that incident, I rebuilt our API infrastructure from the ground up. Over the next three years, handling 50 million daily requests, I developed five principles that transformed our reliability from 99.2% to 99.95% and let me sleep through the night.<\/p>\n<p><strong>Principle 1: Design for Partial Failure<\/strong><\/p>\n<p>Six months after the initial incident, we had another outage. This time, a downstream payment processor went unresponsive. Our API dutifully waited for responses that never came, and request threads piled up until we crashed.<\/p>\n<p>I realized we\u2019d solved one problem but created another. We needed systems that degraded gracefully instead of failing catastrophically.<\/p>\n<p>Here\u2019s what we built:<\/p>\n<p>\nclass ResilientServiceClient:&#13;<br \/>\n    def __init__(self, primary_url, fallback_url):&#13;<br \/>\n        self.primary = primary_url&#13;<br \/>\n        self.fallback = fallback_url&#13;<br \/>\n        self.circuit_breaker = CircuitBreaker(&#13;<br \/>\n            failure_threshold=5,&#13;<br \/>\n            recovery_timeout=30&#13;<br \/>\n        )&#13;<br \/>\n    &#13;<br \/>\n    async def fetch(self, request):&#13;<br \/>\n        # Try primary with circuit breaker protection&#13;<br \/>\n        if self.circuit_breaker.is_closed():&#13;<br \/>\n            try:&#13;<br \/>\n                response = await self.call_with_timeout(&#13;<br \/>\n                    self.primary, request, timeout_ms=500&#13;<br \/>\n                )&#13;<br \/>\n                self.circuit_breaker.record_success()&#13;<br \/>\n                return response&#13;<br \/>\n            except (TimeoutError, ConnectionError):&#13;<br \/>\n                self.circuit_breaker.record_failure()&#13;<br \/>\n        &#13;<br \/>\n        # Fall back to secondary&#13;<br \/>\n        try:&#13;<br \/>\n            return await self.call_with_timeout(&#13;<br \/>\n                self.fallback, request, timeout_ms=1000&#13;<br \/>\n            )&#13;<br \/>\n        except Exception:&#13;<br \/>\n            # Return degraded response rather than error&#13;<br \/>\n            return self.degraded_response(request)<\/p>\n<p>\t\t\t\t&#13;<\/p>\n<tr class=\"crayon-row\">&#13;<\/p>\n<td class=\"crayon-nums \" data-settings=\"show\">&#13;<\/p>\n<p>1<\/p>\n<p>2<\/p>\n<p>3<\/p>\n<p>4<\/p>\n<p>5<\/p>\n<p>6<\/p>\n<p>7<\/p>\n<p>8<\/p>\n<p>9<\/p>\n<p>10<\/p>\n<p>11<\/p>\n<p>12<\/p>\n<p>13<\/p>\n<p>14<\/p>\n<p>15<\/p>\n<p>16<\/p>\n<p>17<\/p>\n<p>18<\/p>\n<p>19<\/p>\n<p>20<\/p>\n<p>21<\/p>\n<p>22<\/p>\n<p>23<\/p>\n<p>24<\/p>\n<p>25<\/p>\n<p>26<\/p>\n<p>27<\/p>\n<p>28<\/p>\n<p>29<\/p>\n<p>&#13;\n\t\t\t\t<\/td>\n<p>&#13;<\/p>\n<td class=\"crayon-code\">\n<p>class ResilientServiceClient:<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0def __init__(self, primary_url, fallback_url):<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0self.primary = primary_url<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0self.fallback = fallback_url<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0self.circuit_breaker = CircuitBreaker(<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0failure_threshold=5,<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0recovery_timeout=30<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0)<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0async def fetch(self, request):<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0# Try primary with circuit breaker protection<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0if self.circuit_breaker.is_closed():<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0try:<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0response = await self.call_with_timeout(<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0self.primary, request, timeout_ms=500<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0)<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0self.circuit_breaker.record_success()<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0return response<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0except (TimeoutError, ConnectionError):<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0self.circuit_breaker.record_failure()<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0# Fall back to secondary<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0try:<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0return await self.call_with_timeout(<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0self.fallback, request, timeout_ms=1000<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0)<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0except Exception:<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0# Return degraded response rather than error<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0return self.degraded_response(request)<\/p>\n<\/td>\n<p>&#13;<br \/>\n\t\t\t\t\t<\/tr>\n<p>&#13;<\/p>\n<p>The key insight: A degraded response is almost always better than an error. Users can work with stale data or reduced functionality. They can\u2019t work with a 500 error.<\/p>\n<p>After <a href=\"https:\/\/thenewstack.io\/devs-dont-just-read-about-design-patterns-implement-them\/\" data-wpil-monitor-id=\"3503\" class=\"local-link\" rel=\"nofollow noopener\" target=\"_blank\">implementing this pattern<\/a> across our services, we stopped having cascading failures. When the payment processor went down again (it did, three more times that year), our API returned cached pricing and queued transactions for later processing. Customers barely noticed.<\/p>\n<p><strong>Principle 2: Make Idempotency Non-Negotiable<\/strong><\/p>\n<p>This lesson came from a $27,000 mistake.<\/p>\n<p>A mobile client had a bug that caused it to retry failed requests aggressively. One of those requests was a payment. The retry logic didn\u2019t include idempotency keys. You can guess what happened next.<\/p>\n<p>A single customer got charged 23 times for the same order. By the time we noticed, we\u2019d processed duplicate charges across hundreds of accounts. The refunds, the customer service hours, the <a href=\"https:\/\/thenewstack.io\/fixing-engineerings-biggest-time-suck-finding-information\/\" data-wpil-monitor-id=\"3502\" class=\"local-link\" rel=\"nofollow noopener\" target=\"_blank\">engineering time to fix<\/a> it cost $27,000.<\/p>\n<p>Now, every mutating endpoint requires an idempotency key. No exceptions.<\/p>\n<p>\nclass IdempotentEndpoint:&#13;<br \/>\n    def __init__(self):&#13;<br \/>\n        self.idempotency_store = RedisStore(ttl_hours=24)&#13;<br \/>\n    &#13;<br \/>\n    async def handle_request(self, request, idempotency_key):&#13;<br \/>\n        # Check if we&#8217;ve already processed this request&#13;<br \/>\n        existing = await self.idempotency_store.get(idempotency_key)&#13;<br \/>\n        &#13;<br \/>\n        if existing:&#13;<br \/>\n            # Return cached response \u2014 don&#8217;t re-execute&#13;<br \/>\n            return Response(&#13;<br \/>\n                data=existing[&#8216;response&#8217;],&#13;<br \/>\n                headers={&#8216;X-Idempotent-Replay&#8217;: &#8216;true&#8217;}&#13;<br \/>\n            )&#13;<br \/>\n        &#13;<br \/>\n        # Process the request&#13;<br \/>\n        result = await self.execute_operation(request)&#13;<br \/>\n        &#13;<br \/>\n        # Cache for future retries&#13;<br \/>\n        await self.idempotency_store.set(&#13;<br \/>\n            idempotency_key,&#13;<br \/>\n            {&#8216;response&#8217;: result, &#8216;timestamp&#8217;: now()}&#13;<br \/>\n        )&#13;<br \/>\n        &#13;<br \/>\n        return Response(data=result)<\/p>\n<p>\t\t\t\t&#13;<\/p>\n<tr class=\"crayon-row\">&#13;<\/p>\n<td class=\"crayon-nums \" data-settings=\"show\">&#13;<\/p>\n<p>1<\/p>\n<p>2<\/p>\n<p>3<\/p>\n<p>4<\/p>\n<p>5<\/p>\n<p>6<\/p>\n<p>7<\/p>\n<p>8<\/p>\n<p>9<\/p>\n<p>10<\/p>\n<p>11<\/p>\n<p>12<\/p>\n<p>13<\/p>\n<p>14<\/p>\n<p>15<\/p>\n<p>16<\/p>\n<p>17<\/p>\n<p>18<\/p>\n<p>19<\/p>\n<p>20<\/p>\n<p>21<\/p>\n<p>22<\/p>\n<p>23<\/p>\n<p>24<\/p>\n<p>25<\/p>\n<p>&#13;\n\t\t\t\t<\/td>\n<p>&#13;<\/p>\n<td class=\"crayon-code\">\n<p>class IdempotentEndpoint:<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0def __init__(self):<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0self.idempotency_store = RedisStore(ttl_hours=24)<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0async def handle_request(self, request, idempotency_key):<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0# Check if we&#8217;ve already processed this request<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0existing = await self.idempotency_store.get(idempotency_key)<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0if existing:<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0# Return cached response \u2014 don&#8217;t re-execute<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0return Response(<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0data=existing[&#8216;response&#8217;],<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0headers={&#8216;X-Idempotent-Replay&#8217;: &#8216;true&#8217;}<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0)<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0# Process the request<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0result = await self.execute_operation(request)<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0# Cache for future retries<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0await self.idempotency_store.set(<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0idempotency_key,<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0{&#8216;response&#8217;: result, &#8216;timestamp&#8217;: now()}<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0)<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0return Response(data=result)<\/p>\n<\/td>\n<p>&#13;<br \/>\n\t\t\t\t\t<\/tr>\n<p>&#13;<\/p>\n<p>We also started rejecting requests without idempotency keys for any POST, PUT or DELETE operation. Some client developers complained initially. Then they thanked us when their retry bugs didn\u2019t cause data corruption.<\/p>\n<p><strong>Principle 3: Version in the URL, Not the Header<\/strong><\/p>\n<p>I learned this one by watching a junior engineer debug an issue for six hours.<\/p>\n<p>We\u2019d been versioning our API through a custom header: X-API-Version: 2. It seemed clean. Kept the URLs tidy.<\/p>\n<p>But when something went wrong, our logs showed the URL and response code \u2014 not the headers. The engineer was looking at logs for \/users\/123 and couldn\u2019t figure out why the behavior was different between two clients. Six hours later, he finally thought to check the version header.<\/p>\n<p>We moved versioning to the URL path that week:<\/p>\n<p>\n\/v1\/users\/123&#13;<br \/>\n\/v2\/users\/123<\/p>\n<p>\t\t\t\t&#13;<\/p>\n<tr class=\"crayon-row\">&#13;<\/p>\n<td class=\"crayon-nums \" data-settings=\"show\">&#13;<br \/>\n\t\t\t\t\t&#13;\n\t\t\t\t<\/td>\n<p>&#13;<\/p>\n<td class=\"crayon-code\">\n<p>\/v1\/users\/123<\/p>\n<p>\/v2\/users\/123<\/p>\n<\/td>\n<p>&#13;<br \/>\n\t\t\t\t\t<\/tr>\n<p>&#13;<\/p>\n<p>Now version information shows up in:<\/p>\n<ul>\n<li aria-level=\"1\">Every log entry<\/li>\n<li aria-level=\"1\">Every trace<\/li>\n<li aria-level=\"1\">Every error report<\/li>\n<li aria-level=\"1\">Every monitoring dashboard<\/li>\n<\/ul>\n<p>The debugging time savings alone justified the migration. But we also established versioning rules that prevented future pain:<\/p>\n<ul>\n<li aria-level=\"1\">Breaking changes require a new version<\/li>\n<li aria-level=\"1\">Additive changes (new optional fields) don\u2019t require a new version<\/li>\n<li aria-level=\"1\">We support at least two versions simultaneously<\/li>\n<li aria-level=\"1\">12-month deprecation notice before sunsetting any version<\/li>\n<\/ul>\n<p>When we do deprecate a version, clients get a Deprecation header warning them for months before we actually turn it off.<\/p>\n<p><strong>Principle 4: Rate Limit Before You Need To<\/strong><\/p>\n<p>We almost learned this lesson the hard way.<\/p>\n<p>A partner company integrated with our API. Their team\u2019s implementation had a bug: When they got a timeout, they\u2019d retry immediately. Infinitely. With exponential parallelism.<\/p>\n<p>At 2 p.m. on a Thursday, their system started sending 50,000 requests per second. We didn\u2019t have rate limiting. We\u2019d always planned to add it \u201cwhen we needed it.\u201d<\/p>\n<p>We needed it.<\/p>\n<p>Fortunately, our load balancer had basic protection that kicked in and started dropping requests. But legitimate traffic got dropped too. For 47 minutes, our API was essentially a lottery \u2014 maybe your request would get through, maybe it wouldn\u2019t.<\/p>\n<p>The next week, we implemented tiered rate limiting:<\/p>\n<p>\nclass TieredRateLimiter:&#13;<br \/>\n    def __init__(self):&#13;<br \/>\n        self.limiters = {&#13;<br \/>\n            &#8216;per_client&#8217;: TokenBucket(rate=100, burst=200),&#13;<br \/>\n            &#8216;per_endpoint&#8217;: TokenBucket(rate=1000, burst=2000),&#13;<br \/>\n            &#8216;global&#8217;: TokenBucket(rate=10000, burst=15000)&#13;<br \/>\n        }&#13;<br \/>\n    &#13;<br \/>\n    async def check_limit(self, client_id, endpoint):&#13;<br \/>\n        # Check all tiers, return first failure&#13;<br \/>\n        for tier_name, limiter in self.limiters.items():&#13;<br \/>\n            key = client_id if tier_name == &#8216;per_client&#8217; else endpoint&#13;<br \/>\n            result = await limiter.check(key)&#13;<br \/>\n            &#13;<br \/>\n            if not result.allowed:&#13;<br \/>\n                return RateLimitResponse(&#13;<br \/>\n                    allowed=False,&#13;<br \/>\n                    retry_after=result.retry_after,&#13;<br \/>\n                    limit_type=tier_name&#13;<br \/>\n                )&#13;<br \/>\n        &#13;<br \/>\n        return RateLimitResponse(allowed=True)<\/p>\n<p>\t\t\t\t&#13;<\/p>\n<tr class=\"crayon-row\">&#13;<\/p>\n<td class=\"crayon-nums \" data-settings=\"show\">&#13;<\/p>\n<p>1<\/p>\n<p>2<\/p>\n<p>3<\/p>\n<p>4<\/p>\n<p>5<\/p>\n<p>6<\/p>\n<p>7<\/p>\n<p>8<\/p>\n<p>9<\/p>\n<p>10<\/p>\n<p>11<\/p>\n<p>12<\/p>\n<p>13<\/p>\n<p>14<\/p>\n<p>15<\/p>\n<p>16<\/p>\n<p>17<\/p>\n<p>18<\/p>\n<p>19<\/p>\n<p>20<\/p>\n<p>21<\/p>\n<p>22<\/p>\n<p>&#13;\n\t\t\t\t<\/td>\n<p>&#13;<\/p>\n<td class=\"crayon-code\">\n<p>class TieredRateLimiter:<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0def __init__(self):<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0self.limiters = {<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0&#8216;per_client&#8217;: TokenBucket(rate=100, burst=200),<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0&#8216;per_endpoint&#8217;: TokenBucket(rate=1000, burst=2000),<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0&#8216;global&#8217;: TokenBucket(rate=10000, burst=15000)<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0}<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0async def check_limit(self, client_id, endpoint):<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0# Check all tiers, return first failure<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0for tier_name, limiter in self.limiters.items():<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0key = client_id if tier_name == &#8216;per_client&#8217; else endpoint<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0result = await limiter.check(key)<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0if not result.allowed:<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0return RateLimitResponse(<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0allowed=False,<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0retry_after=result.retry_after,<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0limit_type=tier_name<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0)<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0return RateLimitResponse(allowed=True)<\/p>\n<\/td>\n<p>&#13;<br \/>\n\t\t\t\t\t<\/tr>\n<p>&#13;<\/p>\n<p>The key details that made this actually useful:<\/p>\n<ul>\n<li aria-level=\"1\">Always return Retry-After headers so clients know when to try again.<\/li>\n<li aria-level=\"1\">Include X-RateLimit-Remaining so clients can see their budget.<\/li>\n<li aria-level=\"1\">Use different limits for different client tiers (partners get more than anonymous users).<\/li>\n<li aria-level=\"1\">Separate limits per endpoint (the search endpoint can handle more than the payment endpoint).<\/li>\n<\/ul>\n<p>That partner\u2019s bug happened again six months later. This time, their requests got rate-limited, our other clients were unaffected, and I didn\u2019t even find out until I checked the metrics the next morning.<\/p>\n<p><strong>Principle 5: If You Can\u2019t See It, You Can\u2019t Fix It<\/strong><\/p>\n<p>The scariest outages aren\u2019t the ones where everything breaks. They\u2019re the ones where something is subtly wrong and you don\u2019t notice for days.<\/p>\n<p>We had an issue where 3% of requests were failing with a specific error code. Not enough to trigger our availability alerts (we\u2019d set those at 5%). Not enough for customers to flood support. But enough that hundreds of users per day were having a bad experience.<\/p>\n<p>It took us two weeks to notice. Two weeks of a broken experience for real users.<\/p>\n<p>After that, we built <a href=\"https:\/\/thenewstack.io\/observability-every-engineers-job-not-just-ops-problem\/\" class=\"local-link\" rel=\"nofollow noopener\" target=\"_blank\">observability into every endpoint:<\/a><\/p>\n<p>\nclass ObservableEndpoint:&#13;<br \/>\n    async def handle(self, request):&#13;<br \/>\n        trace_id = self.tracer.start_trace()&#13;<br \/>\n        start_time = time.time()&#13;<br \/>\n        &#13;<br \/>\n        try:&#13;<br \/>\n            response = await self.process(request)&#13;<br \/>\n            &#13;<br \/>\n            # Record success metrics&#13;<br \/>\n            duration_ms = (time.time() &#8211; start_time) * 1000&#13;<br \/>\n            self.metrics.histogram(&#8216;request_duration_ms&#8217;, duration_ms, {&#13;<br \/>\n                &#8216;endpoint&#8217;: request.path,&#13;<br \/>\n                &#8216;status&#8217;: response.status&#13;<br \/>\n            })&#13;<br \/>\n            self.metrics.increment(&#8216;requests_total&#8217;, {&#13;<br \/>\n                &#8216;endpoint&#8217;: request.path,&#13;<br \/>\n                &#8216;status&#8217;: response.status&#13;<br \/>\n            })&#13;<br \/>\n            &#13;<br \/>\n            return response&#13;<br \/>\n            &#13;<br \/>\n        except Exception as e:&#13;<br \/>\n            # Record failure with context&#13;<br \/>\n            self.metrics.increment(&#8216;requests_errors&#8217;, {&#13;<br \/>\n                &#8216;endpoint&#8217;: request.path,&#13;<br \/>\n                &#8216;error_type&#8217;: type(e).__name__&#13;<br \/>\n            })&#13;<br \/>\n            self.logger.error(&#8216;request_failed&#8217;, {&#13;<br \/>\n                &#8216;trace_id&#8217;: trace_id,&#13;<br \/>\n                &#8216;error&#8217;: str(e)&#13;<br \/>\n            })&#13;<br \/>\n            raise<\/p>\n<p>\t\t\t\t&#13;<\/p>\n<tr class=\"crayon-row\">&#13;<\/p>\n<td class=\"crayon-nums \" data-settings=\"show\">&#13;<\/p>\n<p>1<\/p>\n<p>2<\/p>\n<p>3<\/p>\n<p>4<\/p>\n<p>5<\/p>\n<p>6<\/p>\n<p>7<\/p>\n<p>8<\/p>\n<p>9<\/p>\n<p>10<\/p>\n<p>11<\/p>\n<p>12<\/p>\n<p>13<\/p>\n<p>14<\/p>\n<p>15<\/p>\n<p>16<\/p>\n<p>17<\/p>\n<p>18<\/p>\n<p>19<\/p>\n<p>20<\/p>\n<p>21<\/p>\n<p>22<\/p>\n<p>23<\/p>\n<p>24<\/p>\n<p>25<\/p>\n<p>26<\/p>\n<p>27<\/p>\n<p>28<\/p>\n<p>29<\/p>\n<p>30<\/p>\n<p>31<\/p>\n<p>32<\/p>\n<p>&#13;\n\t\t\t\t<\/td>\n<p>&#13;<\/p>\n<td class=\"crayon-code\">\n<p>class ObservableEndpoint:<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0async def handle(self, request):<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0trace_id = self.tracer.start_trace()<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0start_time = time.time()<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0try:<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0response = await self.process(request)<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0# Record success metrics<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0duration_ms = (time.time() &#8211; start_time) * 1000<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0self.metrics.histogram(&#8216;request_duration_ms&#8217;, duration_ms, {<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0&#8216;endpoint&#8217;: request.path,<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0&#8216;status&#8217;: response.status<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0})<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0self.metrics.increment(&#8216;requests_total&#8217;, {<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0&#8216;endpoint&#8217;: request.path,<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0&#8216;status&#8217;: response.status<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0})<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0return response<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0except Exception as e:<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0# Record failure with context<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0self.metrics.increment(&#8216;requests_errors&#8217;, {<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0&#8216;endpoint&#8217;: request.path,<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0&#8216;error_type&#8217;: type(e).__name__<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0})<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0self.logger.error(&#8216;request_failed&#8217;, {<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0&#8216;trace_id&#8217;: trace_id,<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0&#8216;error&#8217;: str(e)<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0})<\/p>\n<p>\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0raise<\/p>\n<\/td>\n<p>&#13;<br \/>\n\t\t\t\t\t<\/tr>\n<p>&#13;<\/p>\n<p>Our minimum observability requirements now:<\/p>\n<ul>\n<li aria-level=\"1\">Request count by endpoint, status code and client<\/li>\n<li aria-level=\"1\">Latency percentiles (p50, p95, p99) by endpoint<\/li>\n<li aria-level=\"1\">Error rate by endpoint and error type<\/li>\n<li aria-level=\"1\">Distributed tracing across service boundaries<\/li>\n<li aria-level=\"1\">Alerts at 1% error rate, not 5%<\/li>\n<\/ul>\n<p>The 3% failure issue? With our new observability, we would have caught it in minutes, not weeks.<\/p>\n<p><strong>The Results<\/strong><\/p>\n<p>After three years of applying these principles across our API infrastructure:<\/p>\n<p>That last metric is the one I care about most. I went from being woken up twice a week to once every two months.<\/p>\n<p><strong>What I\u2019d Tell My Past Self<\/strong><\/p>\n<p>If I could go back to before that first 3 a.m. call, I\u2019d tell myself:<\/p>\n<ul>\n<li aria-level=\"1\"><strong>Build for failure from day one.<\/strong> Every external call will eventually fail. Every database will eventually go down. Design for it before it happens, not after.<\/li>\n<li aria-level=\"1\"><strong>Make the safe thing the easy thing.<\/strong> Requiring idempotency keys feels like friction until it saves you from a $27,000 mistake. Rate limiting feels unnecessary until a partner\u2019s bug tries to take you down.<\/li>\n<li aria-level=\"1\"><strong>Invest in observability early.<\/strong> You can\u2019t fix what you can\u2019t see. The cost of good monitoring is nothing compared to the cost of not knowing your system is broken.<\/li>\n<li aria-level=\"1\"><strong>Boring is good.<\/strong> The clever solution that\u2019s hard to debug at 3 a.m. isn\u2019t clever. Version in the URL. Return clear error messages. Make the obvious choice.<\/li>\n<\/ul>\n<p>APIs don\u2019t survive by accident. They survive by design, specifically, by designing for the moment when everything goes wrong.<\/p>\n<p>Now, when my phone buzzes at 3 a.m., it\u2019s usually just spam. And that\u2019s exactly how I like it.<\/p>\n<p>\t<a class=\"row youtube-subscribe-block\" href=\"https:\/\/youtube.com\/thenewstack?sub_confirmation=1\" target=\"_blank\" rel=\"nofollow noopener\"><\/p>\n<p>\n\t\t\t\tYOUTUBE.COM\/THENEWSTACK\n\t\t\t<\/p>\n<p>\n\t\t\t\tTech moves fast, don&#8217;t miss an episode. Subscribe to our YouTube<br \/>\n\t\t\t\tchannel to stream all our podcasts, interviews, demos, and more.\n\t\t\t<\/p>\n<p>\t\t\t\tSUBSCRIBE<\/p>\n<p>\t<\/a><\/p>\n<p>    Group<br \/>\n    Created with Sketch.<\/p>\n<p>\t\t<a href=\"https:\/\/thenewstack.io\/author\/sreenivasa-reddy-hulebeedu-reddy\/\" class=\"author-more-link\" rel=\"nofollow noopener\" target=\"_blank\"><\/p>\n<p>\t\t\t\t\t<img decoding=\"async\" class=\"post-author-avatar\" src=\"https:\/\/www.europesays.com\/ie\/wp-content\/uploads\/2025\/12\/1507ea85-cropped-e67f5a04-sreenivasa-reddy-hulebeedu-reddy.jpg\"\/><\/p>\n<p>\n\t\t\t\t\t\t\tSreenivasa Reddy Hulebeedu Reddy is a lead software engineer and enterprise systems architect with more than 13 years of experience building APIs that handle tens of millions of daily requests. He serves as a peer reviewer for Wiley-IEEE Press and&#8230;\t\t\t\t\t\t<\/p>\n<p>\t\t\t\t\t\tRead more from Sreenivasa Reddy Hulebeedu Reddy\t\t\t\t\t\t<\/p>\n<p>\t\t<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"At 3:17 a.m. on a Tuesday, my phone buzzed with the alert that would reshape the way I&hellip;\n","protected":false},"author":2,"featured_media":254561,"comment_status":"","ping_status":"","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[74],"tags":[18,19,17,70939,82],"class_list":{"0":"post-254560","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-technology","8":"tag-eire","9":"tag-ie","10":"tag-ireland","11":"tag-post-contributed","12":"tag-technology"},"share_on_mastodon":{"url":"https:\/\/pubeurope.com\/@ie\/115793300320063739","error":""},"_links":{"self":[{"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/posts\/254560","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/comments?post=254560"}],"version-history":[{"count":0,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/posts\/254560\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/media\/254561"}],"wp:attachment":[{"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/media?parent=254560"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/categories?post=254560"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.europesays.com\/ie\/wp-json\/wp\/v2\/tags?post=254560"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}