Sudden latency regressions in distributed systems are almost always due to throughput-driven contention or queueing at some choke point. As such, the root cause of transaction latency depends on other transactions that are gumming up the works. How can we find the root cause of these interference effects explicitly and without guesswork? And how does that scale to microservice architectures, where each transaction crosses hundreds of process boundaries before making its round-trip?
Solving this problem is the “holy grail” of system analysis, and recent advances in distributed tracing technology bring it within reach of software engineering today. Ben Sigelman explains why this workflow could change the way we understand critical-path latency in distributed systems. Ben begins with a quick summary of the approach Google’s Dapper took with distributed tracing system in the mid-2000s, discussing the limits of its design and its fundamental inability to find the root cause of most contention-related latency issues. Ben then contrasts this with the new world order, where some monitoring technologies can observe a distributed system with full fidelity. Ben then leads an audience-participation demo that connects the dots from a high-latency outlier request to the contended resource it’s waiting on. This workflow is direct, clear, and replaces an entire bevy of other complex and expensive tooling.
Ben Sigelman is the cofounder and CEO of LightStep, where he’s building reliability management for modern systems. An expert in distributed tracing, Ben is the coauthor of the OpenTracing standard, a project within the Linux Foundation’s Cloud Native Computing Foundation (CNCF). Previously, he built Dapper, Google’s production distributed systems tracing infrastructure, and Monarch, Google’s fleet-wide time series collection, storage, analysis, and alerting system. Ben holds a BSc in mathematics and computer science from Brown University.