Tracing
Tracing is a technique used in backend web application development to track the flow of requests and responses between different services and components. The purpose of tracing is to identify performance bottlenecks, detect errors, and troubleshoot issues within the application.
(From the Jaeger docs)
Trace data is generally coordinated using a distributed tracing system, which allows developers to track requests across multiple services and components. These systems collect and correlate trace data from various sources, including application code, servers, and network devices.
Some popular tracing applications include OpenTracing, Zipkin, and Jaeger. These tools provide a framework for instrumenting applications and collecting trace data.
To instrument code for tracing, developers can use tracing libraries or SDKs that are provided by tracing systems. These libraries allow developers to add trace information to requests and responses as they pass through the application.
Application traces typically include information such as request and response headers, timing information, error messages, and performance metrics. By analyzing this data, developers can identify areas of the application that are causing performance problems or errors.
Performance
Percentiles
In software development and performance analysis, p50, p95, and p99 are often used to measure the performance of a system or application.
p50, also known as the median or 50th percentile, represents the value below which 50% of the data falls. In other words, if a system has a p50 response time of 100ms, it means that 50% of the requests are processed within 100ms and the other 50% take longer than that.
p95, also known as the 95th percentile, represents the value below which 95% of the data falls. If a system has a p95 response time of 200ms, it means that 95% of the requests are processed within 200ms and the other 5% take longer than that.
p99, also known as the 99th percentile, represents the value below which 99% of the data falls. If a system has a p99 response time of 500ms, it means that 99% of the requests are processed within 500ms and the other 1% take longer than that.
Long Tail Optimization
We focus on optimizing p95 or p99 requests — or “long tail” requests — instead of optimizing the median request time because these metrics provide a better understanding of the performance of a system or application under load.
The median request time or p50 metric gives an average or typical response time, but it does not provide information about the outliers or the worst-case scenarios. Optimizing for the median may improve the average response time, but it may not improve the performance of the system for the users who experience long response times or errors.
On the other hand, p95 and p99 metrics measure the response time of the requests that are slower than the majority of requests, but still occur frequently. These metrics provide a better understanding of the performance of the system in real-world scenarios, where some requests may take longer to process than others. By optimizing for p95 or p99 response times, developers can improve the performance for a larger percentage of users, including those who experience slower response times.
Tracing and Performance
To diagnose performance issues using tracing, developers and operations teams typically use visualization tools that allow them to view trace data in a meaningful way. These tools may display the flow of requests through the system, with each step annotated with information such as response times and error codes.
By analyzing trace data in this way, developers and operations teams can quickly identify the root cause of performance issues and take appropriate action to resolve them. For example, they may need to optimize the code of a particular service or component, scale up or down certain resources, or implement caching or other performance-enhancing strategies.
Keep the following in mind:
Focus on the longest part of the trace when attempting to resolve performance bottlenecks
Focus on p95 or even p99 traces. Optimizing p50 traces is not an effective way to increase application performance because long tail requests often dominate application performance. By optimizing long tail requests, you will also decrease p50 response times by definition.