Monday, 10 April 2017

Why does my code slow down sometimes (and how to find out who's behind all this)?

Part 2. Understanding


In the previous post of this series we went through available tools for capturing os jitter and offered a new alternative that comes with higher resolution that may be particularly useful when we need to go down to sub-millisecond scales.
So let's use this tool and capture some os jitter once more. This time we're running on a setup that should offer a more tuned execution environment by applying a few basic tricks for making compute resources more available to the user processes and less prone to being hijacked by the os to do whatever os-es do these days.
We're passing a few extra boot parameters to the kernel:

isolcpus=6,7 nohz_full=6,7 rcu_nocbs=6,7 processor.max_cstate=0 intel_idle.max_cstate=0 transparent_hugepage=never idle=poll elevator=noop audit=0 nosoftlockup=0

If you are interested in what they do, have a look at great post from Jeremy Eder or just read through the items in the kernel os jitter README.
These kernel boot params should give us nice undisturbed stream of cpu cycles so our program can execute with no hiccup. Right?
This time we will not only try to capture the jitter profile but also record the underlying activity in an attempt to understand it.

In order to introspect kernel activity we'll need to profile it to identify any sources of jitter.
Most commonly people use ftrace (with trace-cmd, kernel shark) and function_graph plugin. This is a great tool used by many professionals needing to dive into the guts of the system. You could also use perf record and then go through the captured profile with perf report scripts that also offer a zoomed in view on how many cycles particular kernel functions (or userspace routines) took.
If you search for kernel profiling tools, you will also stumble upon sysdig, systemtap or dtrace - if you're on solaris or using experimental support for Linux.

Introducing LTTng

What I personally like to go with is LTTng - a fairly undervalued and relatively little know profiling stack for Linux. It's been around for quite some time and I think it deserves more recognition than it's currently enjoying.
You can read about its rich feature set and how to effectively use it in the user guide.
Having made this basic introduction let's run jitter sampler and lttng profiling in parallel:

$ nohup jitter -c 6 -d 300 -r 10 -o influx://localhost:5581 &; lttng create jitter-sampling && lttng enable-events -k -a && lttng start && sleep 300 && lttng stop

Now, let's have a look at what level of interruptions a userspace thread should experience when running on one of our isolated cores:




Wow this isn't quite what we were expecting, is it? What we were hoping for is having one full core isolated (cpu6, cpu7 for a HT setup) helping us achieve a relatively low level of jitter.
Few-millisecond spikes don't really match the expectations. So why not zoom in on one of these spikes and try to correlate it what's been recorded by LTTng.
One of the best things about the tool is, in my opnion, the fact that it persists the recorded events in CTF (common trace format) which can be loaded into another neat piece of software: the Trace Compass.


Cool, isn't it? I'll leave discovering its wonders to your eager mind. For now let's stick to the relevant core and the work it was performing at the time of the stall (21:40:07.200):




So now things are becoming clear. Our jitter sampling thread got switched out by a kthread performing writeback activity as part of processing associated work queue.
Work queues are one of Linux kernel's mechanisms for managing chunks of deferred work usually related to filesystem io, block devices or other system routines.
Let's fix this by affinitizing the writeback workqueue with cpus 0-5 for now. This will exclude our latency critical set (cpus 6,7).

echo 3f > /sys/bus/workqueue/devices/writeback/cpumask

It's time to re-run the jitter sampler and see if there is any improvement.



Much better...
Although affinitizing workqueues, irqs or rcu callbacks is a pretty fundamental technique most performance-oriented developers are familiar with the focus of this post is on something entirely different.
I hope this short material demonstrates that hunting latency spikes doesnt's always have to be an intimidating experience or a feat of some dark systems engineering art. By employing and combining the right tools, it may become a fruitful exercise even for those of us who don't feel very comfortable with delving into the insides of a OS.
Stay tuned for next post from the series where we'll take a closer look at HW-driven source of execution stalls and other less-known classes of events that contributed to the so-called platform noise.

Sunday, 2 April 2017

Why does my code slow down sometimes (and how to find out who's behind all this)?

Part 1 - Seeing is believing

Most of us will face this problem at some point. Our code (in various forms) runs happily whatever environment we threw it onto and then some day someone complains about one request, possibly out of tens of thousands, taking forever and timing out or otherwise missing the SLA.
Or maybe there was a whole bulk of short-lived transactions that never went through or a GC pause that took forever. There are many reasons why code may appear to have suddenly slowed down or even paused for a while.
In part 1 of this blog series I'd like to offer a simple approach to capturing the noise coming from the operating system. That is anything that prevents a user-space thread from munching away those yummy cpu cycles in a continuous and undisturbed fashion. We'll leave out some of the possible culprits like NMIs, MCE polls, TLB shootdowns, thermal throttling and the like as they may not be considered a strictly os-driven events and will be a topic of a separate post in this series.
The following material is useful mainly for linux-based environments which, from my experience, would be the overwhelming majority of serious production systems.

There are already some very useful tools to measure os jitter but over the years I learned that they all have some restrictions that make them of limited use for relatively detailed analysis:

jHiccup - a great tool written and open-sourced by Gil Tene, the CTO of Azul Systems. Focused mainly on capturing jitter in the context of Java applications. This is a very mature an quite widespread tool with lots of interesting options like various output formats, being able to attach to a running JVM process, executing an idle control process for acquiring a baseline to compare against and a very low footprint (mainly due to employing another Gil Tene's brilliant child - the HDR Histogram). My problem with it, however was that out of the box jHiccup does not support high resolution timing (due to how it captures jitter) and time-series output doesn't contain timestamps (it does contain relative time, however).

sysjitter - a tiny c unit provided and open-sourced by SolarFlare as part of their OpenOnload stack. It works quite differently from jHiccup by burning all cpus for the duration of the measurement and producing a summary of quantiles for observed stalls in execution per cpu. Obviously lacks time-series output.

MicroJitterSampler - yet another Java utility (class) to capture jitter experienced by Java code. It's part of Peter Lawrey's famous Java-Thread-Affinity library that some of you are probably already familiar with. It does not employ any generic histogram data structure so is unable to give you a nice quantile breakdown of observed latencies. Also does not offer a detailed sequence of measurements in a form of time-series data.

JLBH - another useful tool from Peter Lawrey and his team. This one is a part of their Chronicle library. In many ways it's similar to MicroJitterSampler so shares the same shortcomings.

Maybe I was expecting too much but the landscape of OS-jitter measurement tools is still pretty much crickets and tumbleweed. Also, for a very good reason there is a need for a tool that would allow one to pinpoint a particular spike in latency and associate a high-precision timestamp with it.
When you're trying to understand what exactly is causing the slow-downs you first need to be able to isolate the problematic event from many others to be able to surgically zoom in on the culprit.
So the 99.99th percentile will tell you there's a problem, and also how bad it is, but it definitely won't get you closer to understanding why or what the possible cause may be.
For that you will need precise, high granularity measurements, especially if you're on the sub-millisecond scale.

I was experimenting with java-based jitter sampler that would allow me to achieve that by combining sysjitter approach of spinning on a cpu(s) and capturing any disturbances with producing a rich, high precision time series for observed latency. I eventually concluded that it's easier to guarantee stable, more reliable execution profile with native code, so decided to go with some simple c code:
#include <stdio.h>#include <time.h>#include <string.h>#include <stdlib.h>#include <sched.h>#include "jitter.h"#include "influx.h"#include "csv.h"#define NANOS_IN_SEC 1000000000ulstruct timespec ts;long long cpu_mask = -1l;void print_usage() {    puts("Usage:\n");    puts("-h\tdisplay this usage message");    puts("-c\ttarget cpu to run on (default: any)");    puts("-d\tsampling duration in seconds (default: 60)");    puts("-r\tjitter reporting interval in milliseconds (default: 1000)");    puts("-t\tjitter reporting threshold in nanoseconds; delays below this value will not be reported (default: 300)");    puts("-o\toutput results using an output plugin. Supported plugins:");    puts("\tinflux://<host:port>\tstores results in influxDb with line protocol over UDP");    puts("\tcsv://<file>\tstores results in a csv file");    exit(0);}unsigned long long nano_time() {    clock_gettime(CLOCK_REALTIME, &ts);    return ts.tv_sec * NANOS_IN_SEC + ts.tv_nsec;}unsigned long long int capture_jitter(unsigned long long int duration, unsigned long long int granularity,                                      unsigned long long int threshold, struct jitter *jitter) {    unsigned long long ts = nano_time();    unsigned long long deadline = ts + duration;    unsigned long long next_report = ts + granularity;    unsigned long long max = 0;    unsigned long long idx = 0;    sched_setaffinity(0, sizeof(long long), (const cpu_set_t *) &cpu_mask);    while (ts < deadline) {        unsigned long long now = nano_time();        unsigned long long latency = now - ts;        if (latency > max) max = latency;        if (now > next_report) {            if (max > threshold) jitter[idx++] = (struct jitter) {.timestamp = now, .delay = max};            max = 0;            next_report = now + granularity;        }        ts = now;    }    return idx;}int main(int argc, char* argv[]) {    unsigned long long duration = 60 * NANOS_IN_SEC;    unsigned long long granularity = NANOS_IN_SEC;    unsigned long long threshold = 300;    process_output out_function;    int idx = 1;    for (; idx < argc; idx++) {        if (strcmp("-h", argv[idx]) == 0) print_usage();        else if (strcmp("-c", argv[idx]) == 0) cpu_mask = 1 << strtol(argv[++idx], (char **)NULL, 10);        else if (strcmp("-d", argv[idx]) == 0) duration = strtol(argv[++idx], (char **)NULL, 10) * NANOS_IN_SEC;        else if (strcmp("-r", argv[idx]) == 0) granularity = strtol(argv[++idx], (char **)NULL, 10) * 1000000ul;        else if (strcmp("-t", argv[idx]) == 0) threshold = strtol(argv[++idx], (char **)NULL, 10);        else if (strcmp("-o", argv[idx]) == 0) {            char *output = argv[++idx];            if (strstr(output, "influx://")) out_function = init_influx(output+9);            else if (strstr(output, "csv://")) out_function = init_csv(output+6);        }    }    printf("duration: %llus\n", duration/NANOS_IN_SEC);    printf("report interval: %llums\n", granularity/1000000);    printf("report threshold: %lluns\n", threshold);    struct jitter* jitter = calloc(duration/granularity, sizeof(struct jitter));    long long data_points = capture_jitter(duration, granularity, threshold, jitter);    out_function(data_points, jitter);    return 0;}

The influxDb and CSV output plugins reside in separate source files. The whole project can be checked out from my gitlab repo.

Unlike jHiccup it only captures max observed delay per reporting interval, but the idea behind this tool is to offer high resolution data with classic time series characteristics. Having dense data series allows you to compute and transform it as a whole or partitioned by arbitrary time windows.
Unfortunately, as far as I understand the underlying maths, in most cases you can't do that reliably with a series of consecutive histograms so having raw data can be an advantage.

When we finally have the data we can start analyzing it with our favourite tools like Grafana (with influxDb as datasource):

Or R (with a CSV file):



In the next post I'll show how we can make this data useful way beyond looking nice on our eye-candy graphs and pinpoint the culprits behind all those interesting spikes.

Please bear in mind the jitter-sampler code is not polished, lacks error checking, could probably use some optimizations (ie. mmap with MAP_HUGETLB instead of calloc) so feel free to contribute fixes/improvements.

Part 2 - Understanding