The Case of a Leaky Goroutine

In the programming language Go, it’s very easy to build something using high-level concurrent patterns thanks to the concept of Goroutines and channels used to signal between them. A Goroutine is essentially a coroutine that maps onto green threads that map onto real native threads on your OS in an NxM way. The simple go func() prepend-style syntax makes fire-and-forget Goroutines for executing small tasks in parallel trivial.

Or is it? If we are to believe Katherine Cox-Buday, the author of O’Reilly’s Concurrency In Go, it’s not:

Concurrency can be notoriously difficult to get right, but fortunately, the Go programming language was designed with concurrency in mind. In this practical book, you’ll learn how Go was written to help introduce and master these concepts, as well as how to use basic concurrency patterns to form large systems that are reliable and remain simple and easy to understand.

That sounds rather optimistic, but the countless of memory leaks in Go and how to avoid them articles and leak detection packages tell us otherwise. The most common misuses or cases of “leaky” Goroutines—routines that live on forever even though we think they’re garbage collected—are neatly laid out by Uber’s Georgian-Vlad Saioc in their LeakProf Goroutine Leak Detection system.

We stumbled upon a leaky gut—erm, code gut?—two weeks ago when an Out Of Memory suddenly restarted Kubernetes pods halfway through workflow runs that of course are not quite idempotent. Not knowing where to begin, we fired up Go’s profiler pprof and got to work. After a day of poking around, we found our own version of a never-ending Goroutine factory. This post summarizes our findings in case they might come in handy for others or my future self.

Identifying the problem (Profiling)

Grafana’s dashboard can monitor Goroutine memory usage and seeing it spike without going down is an obvious red flag, but doesn’t give you details just yet. For that, you can stay with the Grafana stack using Pyroscope that charts out interactive memory flame graphs based on pprof dumps it pulls from your container (provided the whole setup shebang is done right):

The chart tells us that runtime.gopark is holding onto Goroutines coming from pipeline funcs we didn’t even know existed. Lo and behold, these convert contexts into channels using generic interface{}s as part of the pipeline by creating a Goroutine and waiting for the channel to be done—except that it’ll never be, since the context that’s passed in isn’t a derived one like .WithCancel(). In other words, the context will cancel if the whole root request ends—which can be never for a background job with a background context. Whoops. We’ll get back to that.

You can also run Pyroscope locally using Docker, by the way: docker run -it -p 4040:4040 grafana/pyroscope.

If you don’t care about Grafana, no worries, pprof comes with a HTTP server or attaches itself to yours once you import import _ "net/http/pprof". From now on, /debug/pprof/ is an endpoint where heap/CPU/whatever can be dumped from using for instance CURL. See the Official pprof docs and the Go dev blog entry on pprof for more information.

Once you managed to get your profile dump, you can analyze it with go tool pprof [profile_file]. If you’ve installed graphviz, it’ll generate visual representation of your snapshot, as seen in the aforementioned Go dev blog entry¹. The most interesting view is of course a diff between a baseline and one after lots of leaky Goroutine work—use the diff_base flag for that (see the Go dev blot entry on pgo). Profile percentages are relative to the first dump.

Let’s get back to that context that’s never truly cancelled. This piece of code is the perpetrator:

func ToDoneInterface(done <-chan struct{}) <-chan interface{} {
    interfaceStream := make(chan interface{})
    go func() {
       defer close(interfaceStream)
       select {
       case <-done:
          return
       }
    }()
    return interfaceStream
}

The defer close() seems to close well, but it’s on the wrong channel. The select{} will wait until it received a signal from done: that’s either when something is sent or when a close() has been called (a nil value). So, the Goroutine closes the channel after the first received struct{} on the passed done channel. If we pass the same channel multiple times—which we do—and that channel lives longer than is the case wich ctx.Done()—which it is—this Goroutine will leak.

I don’t know if that all makes sense if you’re not familiar with Go, or even if you are. I know I had to stare at the above code block for a good hour and its usage context (got it, Context? Go-joke!) before realizing something wasn’t as it was supposed to be here.

Reproducing the problem (Fixing)

There’s a neat way to detect memory leaks in tests using the package goleak:

func TestA(t *testing.T) {
  defer goleak.VerifyNone(t)

  // test logic here.
}

It works by looking at what’s still on the stack after everything should be garbage-collected. We then cooked up a script that spins up a consume/produce cycle using context.Background() as the root context without cancellation and then with it. The Pyroscope Go API can act as a helpful shortcut to auto-feed profiles straight from your program.

The most systematic way to detect leaky Goroutines early must be Uber’s LeakProf, a separately deployed system that regularly pulls in pprof dumps, enriches it with stack data, closely monitors memory usage, and even automatically files a bug report in case the shit hits the fan. I don’t think we’re there yet!

Conclusion: the problem is often hidden in a small corner… Don’t convert channels! Stick with Go’s built-in context pattern and derive from the one passed in if needed!

go

The Case of a Leaky Goroutine

Identifying the problem (Profiling)

Reproducing the problem (Fixing)

Bayesian Statistics: The three cultures

Reverse-engineering my speakers’ API to get reasonable volume control

Zen 5’s 2-ahead branch predictor: how a 30 year old idea allows for new tricks

LEAVE A REPLY Cancel reply

Most Popular

Facebook doesn’t think hackers accessed third-party sites

It’s getting a lot harder for global brands to win in China

Why it’s time for investors to go on the defense

Facebook doesn’t think hackers accessed third-party sites

Recent Comments

EDITOR PICKS

Top Fashion Trends to Look for in Every Important Collection

Spring Fashion Show at the University of Michigan Has Started

Top Ten Kitchen Shortcuts for Indian Food Delights

POPULAR POSTS

Reflecting on 18 Years at Google

Gboard Hat Version

Feathered robotic wing paves way for flapping drones

POPULAR CATEGORY