One Agent Is Easy. Fifty Agents at 2 AM Is Not.

A single AI model behind an API is a solved problem. Monitor the endpoint, track your tokens, watch your latency and you’re done. But that’s not what production AI looks like at enterprise scale. Production looks like a multi-model serving platform spanning multiple clouds, with safety screening, prompt caching, embedding pipelines, model pools, and async billing all serving dozens of consumers simultaneously. When a user reports slow responses, the root cause could be a safety model bottleneck three hops upstream, a cron job that evicted rate limit keys from a shared Redis, or a cross-cloud network issue that triggered fallback routing and cascaded into a completely different model pool.

These aren’t hypothetical scenarios. They’re the kind of failures that page you at 2 AM, and many instincts we have about where to look are wrong. The inference layer is idle but the safety screening service is saturated. The model pool is overwhelmed but only because fallback routing is sending it traffic it was never sized for. A service deployed cleanly with zero errors but it’s writing bad data to a shared database that breaks rate limiting fifteen minutes later.

In this talk, I’ll walk through a live multi-cloud AI serving platform in real time, tracing actual failures from symptom to root cause. We’ll explore how to instrument AI infrastructure with OpenTelemetry so your traces capture the signals that matter: routing decisions, safety screening latency, model pool contention, cache behavior and not just HTTP status codes. How to use topology-aware dependency mapping to understand the non-obvious coupling between services that share compute resources, databases, and message queues. How to distinguish between “this service is slow” and “this service is starved because something upstream is blocking” (a critical difference when your AI platform has a deep request chain!). And how causal analysis connects a deployment event to a downstream failure four hops away and fifteen minutes later, when correlation alone gives you noise. You’ll leave with a practical understanding of what AI observability actually requires at platform scale and why the hardest bugs are the ones where every dashboard is green.

Matt Rein

Lead Solutions Engineer

Dynatrace

Matt Rein is an experienced Lead Sales Engineer with a strong background in software development, technical sales, and mentoring. With over a decade of experience in the tech industry, he has worked in leadership roles across multiple companies, with a current focus on helping AI‑native and cloud‑scale organizations adopt effective observability practices. He specializes in educating stakeholders, driving product adoption, and solving complex technical challenges.

Currently at Dynatrace, Matt leverages his expertise to help organizations optimize their software intelligence and observability strategies at scale. He previously held key roles at Ionic and ManageAmerica, where he led development teams, implemented agile practices, and drove cloud and mobile initiatives.

Beyond his professional roles, Matt is passionate about mentoring and teaching, serving as an Instructor at Nucamp Coding Bootcamp, where he helps aspiring developers build their skills in web technologies. He holds a Bachelor of Arts from St. Olaf College and is committed to continuous learning and innovation.

With a strong foundation in software architecture, modern application development, DevOps, and cross‑team collaboration, Matt excels at bridging technical and business stakeholders to drive outcomes.

One Agent Is Easy. Fifty Agents at 2 AM Is Not.

Matt Rein

Stay in the Loop!