Go Post-mortem

Jan 16, 2018

Introduction

Even if you write tests and those tests are perfect, effective and everything you will face production issues. Sometimes everything will just crash and when that’ll happen you’ll need some debugging skills. Here I’m putting some flow examples, notes and resources around post-mortem debugging of Go programs, including, but not limited to, those running inside a Kubernetes cluster.

I’m a huge fan of GDB, I used it for almost a decade now and while I remember it was a bit clunky and difficult at first but it got better in the years and in the meantime I got used to it and it’s commands and perks.

A few years after my first gdb, I saw Delve and had the same feeling: it was clunky, and it had nearly anything I needed. Now, after a couple of years Delve seems pretty complete and it just works, one thing I really like about Delve is that even if the terminal interface is similar to other debuggers (like gdb) has a very well done UX, for example:

This is the current execution point of a program under Delve:

    29:         for {
    30:                 select {
    31:                 case message := <-ch:
    32:                         fmt.Println(message)
    33:                 case <-time.After(time.Second * 3):
=>  34:                         panic("I'm panicking because of timeout!")
    35:                 }
    36:         }
    37: }

This is how the same thing looks like in GDB

30                      select {
31                      case message := <-ch:
32                              fmt.Println(message)
33                      case <-time.After(time.Second * 3):
34                              panic("I'm panicking because of timeout!")
35                      }
36              }
37      }

However, one could extend GDB by customizing .gdbinit like how the GDB Dashboard does:

GDB Dashboard Screenshot

A part from the UX, Delve has several advantages over GDB when debugging Go:

It works out of the box - (Just Works™)
Integrated with major editors: Vim/Nvim, Code, GoLand
It has an impressivly well working support for Go concurrency patterns
Even post-mortem core dump analysis works just right

On the other hand, GDB has still some advantages over it:

It supports any architecture (Delve only amd64)
It supports any OS (Delve only Linux, OSX and Windows)
You can debug Cgo
You need to extend your debugging to the Go runtime itself, GDB is your tool

So, how to say, I’m not very religious about the debugger I use but generally when I need to choose Delve or GDB I just go with Delve and then if Delve breaks I try with GDB. It’s not very scientific but it’s a wise way to proceed.

I will show a few examples:

You need to debug an API server entirely written in Go, delve
You need to debug some Go program orchestrating linux namespaces (like runc), delve breaks and I use GDB because in to interact with kernel features you need Cgo!
You need to debug a program compiled for ppc64le? delve breaks, I use GDB
You have a router mounting a mips processor and you want to debug your own DHCP written for it in Go? you know the debugger, true story.

See it? Straightforward!

Proceed with the Autopsy

When you do an autopsy over a crashed Go program there are two things you need:

A core dump
A non-stripped, with DWARF and debug symbols binary of your program

To get a core dump, the first thing you do is to check the maximum file size for a core dump file

ulimit -c

If it is zero, you are not core dumping so you need to raise the limit.

ulimit -c unlimited

Also, when debugging, you have to remember that the Go runtime can be said to trigger certain behavior that can make debugging easier. In this case, we are interested to say the Go runtime to trigger a core dump by actually doing a segfault instead of just exiting in case of panic.

To do so, when we run the Go program we need to run it with GOTRACEBACK=crash like this.

GOTRACEBACK=crash ./myprogram

There are also other environment variables and behaviors for GOTRACEBACK, if you want to discover more take a look here.

Now that your system can core dump you need to restart the program and wait for it to crash.

You can obtain the core file of a running program using gcore, see man gcore for more info.

Core files in linux are written with a template defined in /proc/sys/kernel/core_pattern

cat /proc/sys/kernel/core_pattern
|/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %e

In my case it is using the systemd-coredump program to write files, and I get lz4 compressed files like this:

/var/lib/systemd/coredump/core.godebugging.1000.a0b55b870a3f443696cf7cb874d7f27b.32124.1515962062000000.lz4

Production just got hot

Generally speaking when a Go binary is built for production most of us will use something similar to:

go build -ldflags "-s -w" .

which has the effect of removing the symbol table and DWARF symbol table from the program.

Semi OT: if you are a digger, read this blog from Dave Cheney on what the Go build command does, you’ll see a few interesting things.

The good news is that we can analyze a core file with a binary built from the same source code, this time without those ldflags.

I don’t own the code

But in case you don’t have or own the source be aware that without a symbol table we can’t debug a program with Delve so with gdb and some Assembly skills you can use gdb to use the binary you have with the core file you got from the production system, then you can start analyzing what happened:

gdb ./godebugging core.1234

[New LWP 32124]
[New LWP 32125]
[New LWP 32126]
[New LWP 32127]
[New LWP 32128]
Core was generated by `./godebugging'.
Program terminated with signal SIGABRT, Aborted.
#0  0x0000000000455594 in ?? ()
[Current thread is 1 (LWP 32124)]

(gdb) bt
#0  0x0000000000455594 in ?? ()
#1  0x000000000043d10b in ?? ()
#2  0x0000000000000006 in ?? ()
#3  0x0000000000000000 in ?? ()

(gdb) x/30i $pc
0x455600:    sub    $0x18,%rsp
0x455604:    mov    %rbp,0x10(%rsp)
0x455609:    lea    0x10(%rsp),%rbp
0x45560e:    mov    0x10c96b(%rip),%rax        # 0x561f80
0x455615:    cmp    $0x0,%rax
0x455619:    je     0x45563f
0x45561b:    xor    %edi,%edi
0x45561d:    lea    (%rsp),%rsi
0x455621:    callq  *%rax
0x455623:    mov    (%rsp),%rax
0x455627:    mov    0x8(%rsp),%rdx
0x45562c:    mov    %rax,0x20(%rsp)
0x455631:    mov    %edx,0x28(%rsp)
0x455635:    mov    0x10(%rsp),%rbp
0x45563a:    add    $0x18,%rsp
0x45563e:    retq   
0x45563f:    lea    (%rsp),%rdi
0x455643:    xor    %esi,%esi
0x455645:    mov    0xdaa7c(%rip),%rax        # 0x5300c8
0x45564c:    callq  *%rax
0x45564e:    mov    (%rsp),%rax
0x455652:    mov    0x8(%rsp),%edx
0x455656:    imul   $0x3e8,%rdx,%rdx
0x45565d:    mov    %rax,0x20(%rsp)
0x455662:    mov    %edx,0x28(%rsp)
0x455666:    mov    0x10(%rsp),%rbp
0x45566b:    add    $0x18,%rsp
0x45566f:    retq   
0x455670:    sub    $0x18,%rsp
0x455674:    mov    %rbp,0x10(%rsp)

The first thing I did was to use the bt command to analyze how the backtrace looks like, and then I showed the next 30 lines of Assembly with:

x/30i $pc

All this is not very useful unfortunately but in case of problems it can be of help.

I own the code

On the other hand if you own the code your life is just easier.

In this situation, you’ll likely to have an optimized binary in production that can’t be used for reading the core dump. If this is the case the thing you can do is to compile the same source code without the optimizations and then use the debugger to read the dump file along with it.

With Delve

dlv core ./godebugging core.1234

(dlv) bt
0  0x0000000000455594 in runtime.raise
   at /usr/lib/go/src/runtime/sys_linux_amd64.s:113
1  0x0000000000451f70 in runtime.systemstack_switch
   at /usr/lib/go/src/runtime/asm_amd64.s:298
2  0x0000000000427ac8 in runtime.dopanic
   at /usr/lib/go/src/runtime/panic.go:586
3  0x000000000042765e in runtime.gopanic
   at /usr/lib/go/src/runtime/panic.go:540
4  0x00000000004952d9 in main.main
   at ./main.go:34
5  0x00000000004292a6 in runtime.main
   at /usr/lib/go/src/runtime/proc.go:195
6  0x00000000004545c1 in runtime.goexit
   at /usr/lib/go/src/runtime/asm_amd64.s:2337

Things are so clear, bt is showing that main.go broke at line 34 leading to a panic, let’s see what happened, the problem seemed to be at frame 4 (the 4 in the backtrace), so I can print the source of that frame:

(dlv) frame 4 list
Goroutine 1 frame 4 at /home/fntlnz/go/src/github.com/fntlnz/godebugging/main.go:34 (PC: 0x4952d9)
    29:         for {
    30:                 select {
    31:                 case message := <-ch:
    32:                         fmt.Println(message)
    33:                 case <-time.After(time.Second * 3):
=>  34:                         panic("I'm panicking because of timeout!")
    35:                 }
    36:         }
    37: }

Wow! That’s a panic, I wrote it so that the program would crash if a timeout occurs, well done program.

Now I’m interested to see the state of the goroutines at the crash

(dlv) goroutines
[6 goroutines]
* Goroutine 1 - User: ./main.go:34 main.main (0x4952d9) (thread 32124)
  Goroutine 2 - User: /usr/lib/go/src/runtime/proc.go:288 runtime.gopark (0x42974c)
  Goroutine 3 - User: /usr/lib/go/src/runtime/proc.go:288 runtime.gopark (0x42974c)
  Goroutine 4 - User: /usr/lib/go/src/runtime/proc.go:288 runtime.gopark (0x42974c)
  Goroutine 5 - User: ./main.go:21 main.producer (0x4950b0)
  Goroutine 6 - User: /usr/lib/go/src/runtime/proc.go:288 runtime.gopark (0x42974c)

Nice, there where six goroutines, a few ones parked too

Between all those goroutines the one t hat seems interesting is the Goroutine 5, let’s see what’s inside:

(dlv) goroutine 5
Switched from 1 to 5 (thread 32124)
(dlv) bt
0  0x000000000042974c in runtime.gopark
   at /usr/lib/go/src/runtime/proc.go:288
1  0x000000000042983e in runtime.goparkunlock
   at /usr/lib/go/src/runtime/proc.go:293
2  0x0000000000403c0b in runtime.chansend
   at /usr/lib/go/src/runtime/chan.go:222
3  0x0000000000403993 in runtime.chansend1
   at /usr/lib/go/src/runtime/chan.go:113
4  0x00000000004950b0 in main.producer
   at ./main.go:21
5  0x00000000004545c1 in runtime.goexit
   at /usr/lib/go/src/runtime/asm_amd64.s:2337
(dlv) frame 4 ls
Goroutine 5 frame 4 at /home/fntlnz/go/src/github.com/fntlnz/godebugging/main.go:21 (PC: 0x4950b0)
    16: }
    17:
    18: func producer(ch chan<- string) {
    19:         for {
    20:                 time.Sleep(time.Second * time.Duration(rand.Intn(4))) // simulate some work
=>  21:                 ch <- messages[rand.Intn(len(messages)-1)]
    22:         }
    23: }
    24:
    25: func main() {
    26:         ch := make(chan string)

When in the Goroutine 5, I can hook in the frame 4 to see the content of the messages and ch variables:

(dlv) frame 4 p ch
chan<- string {
        qcount: 0,
        dataqsiz: 0,
        buf: *[0]string [],
        elemsize: 16,
        closed: 0,
        elemtype: *runtime._type {
                size: 16,
                ptrdata: 8,
                hash: 3774831796,
                tflag: 7,
                align: 8,
                fieldalign: 8,
                kind: 24,
                alg: *(*runtime.typeAlg)(0x5420f0),
                gcdata: *1,
                str: 6602,
                ptrToThis: 55936,},
        sendx: 0,
        recvx: 0,
        recvq: waitq<string> {
                first: *sudog<string> nil,
                last: *sudog<string> nil,},
        sendq: waitq<string> {
                first: *(*sudog<string>)(0xc4200a2000),
                last: *(*sudog<string>)(0xc4200a2000),},
        lock: runtime.mutex {key: 0},}
(dlv) frame 4 p messages
[6]string [
        "There's something that doesn't make sense. Let's go and poke it ...+13 more",
        "We're all stories, in the end.",
        "Bow ties are cool.",
        "One may tolerate a world of demons for the sake of an angel.",
        "You want weapons? We’re in a library! Books! The best weapons ...+13 more",
        "Do what I do. Hold tight and pretend it’s a plan!",
]

With GDB things are similar, but not that nice:

gdb godebugging core.1234

>>> bt
#0  runtime.raise () at /usr/lib/go/src/runtime/sys_linux_amd64.s:113
#1  0x000000000043d10b in runtime.dieFromSignal (sig=6) at /usr/lib/go/src/runtime/signal_unix.go:400
#2  0x000000000043d299 in runtime.crash () at /usr/lib/go/src/runtime/signal_unix.go:482
#3  0x00000000004281b2 in runtime.dopanic_m (gp=0xc420000180, pc=4355678, sp=842350763480) at /usr/lib/go/src/runtime/panic.go:732
#4  0x0000000000450eac in runtime.dopanic.func1 () at /usr/lib/go/src/runtime/panic.go:587
#5  0x0000000000451ff9 in runtime.systemstack () at /usr/lib/go/src/runtime/asm_amd64.s:344
#6  0x000000000042c3a0 in ?? () at /usr/lib/go/src/runtime/proc.go:1070
#7  0x0000000000544100 in runtime.sched ()
#8  0x00007fff47a3d690 in ?? ()
#9  0x0000000000544160 in ?? ()
#10 0x00007fff47a3d680 in ?? ()
#11 0x000000000042c404 in runtime.mstart () at /usr/lib/go/src/runtime/proc.go:1152
#12 0x0000000000451e21 in runtime.rt0_go () at /usr/lib/go/src/runtime/asm_amd64.s:186
#13 0x0000000000000001 in ?? ()
#14 0x00007fff47a3d6c8 in ?? ()
#15 0x0000000000000001 in ?? ()
#16 0x00007fff47a3d6c8 in ?? ()
#17 0x0000000000000000 in ?? ()

A notable point here is proc.go:1070 so I’ll dig into that frame, also it seems that there has been a panic after.

>>> frame 6
#6  0x000000000042c3a0 in ?? () at /usr/lib/go/src/runtime/proc.go:1070
1070	func startTheWorldWithSema() {
>>> l
1065	func mhelpgc() {
1066		_g_ := getg()
1067		_g_.m.helpgc = -1
1068	}
1069	
1070	func startTheWorldWithSema() {
1071		_g_ := getg()
1072	
1073		_g_.m.locks++        // disable preemption because it can be holding p in a local var
1074		gp := netpoll(false) // non-blocking

What I see is Go code but not mine. Also given that this is a dump of a specific state and not an interactive debugging session I cannot control the state so I can’t even inspect goroutines in such situation.

The most useful thing I can do now is to get a full backtrace dump printed to a file

set logging on
bt full

This creates a gdb.txt file in the current directory where I see that there’s a known variable I can try to print

>>> p main.messages
$1 = {[0] = "There's something that doesn't make sense. Let's go and poke it with a stick.", [1] = "We're all stories, in the end.", [2] = "Bow ties are cool.", [3] = "One may tolerate a world of demons for the sake of an angel.", [4] = "You want weapons? We’re in a library! Books! The best weapons in the world!", [5] = "Do what I do. Hold tight and pretend it’s a plan!"}

However no luck in getting the right place in my own code where the panic occurred :/

The program runs in a Kubernetes cluster

In case your crashed program runs in a kubernetes cluster the good news is that you can just get your core dumps inside the machines and analyze them like if they weren’t containers (unexpected right?).

The bad news is that in case of a cluster with a number of nodes you will need some effort to understand on which node the process crasheed and the way to get the core dump may not be all that straightforward.

However the other good news is that there’s a pull request kubernetes/community#1311 to add a crd just to do that, seems magical!

In the past on this kind of distributed systems (prior to kubernetes), I was used to write a program to be used in /proc/sys/kernel/core/core_pattern that would write the dumps to a shared filesystem, usually NFS.