Less known Solaris features: About crashes and cores - Part 4: Crashdump analysis for beginners

Okay, now you have all this crash and core dumps, it would be nice to do something useful with it. Okay, i show you just some basic tricks to get some insight into the state of a system when it wrote a crash dump.

Basic analysis of a crash dump with mdb

At first we load the dump into the mdb:

# mdb unix.0 vmcore.0<br />
Loading modules: [ unix genunix specfs cpu.generic uppc scsi_vhci ufs ip hook neti sctp arp usba nca lofs zfs random nsctl sdbc rdc sppp ]<br />

A nice information is the backtrace. This helps you to find out, what triggered the crash dump. In this case it´s easy. It´s was the uadmin syscall.

> $c<br />
vpanic(fea6388c)<br />
kadmin+0x10c(5, 0, 0, db39e550)<br />
uadmin+0x8e()<br />

But it would be nice, to know more of the state of the system, at the moment of the crash. For example we can print out the process table of the system like we would do it with ps

> ::ps<br />
S    PID   PPID   PGID    SID    UID      FLAGS     ADDR NAME<br />
R      0      0      0      0      0 0x00000001 fec1d3d0 sched<br />
[...]<br />
R    586      1    586    586      0 0x42000000 d55f58a8 sshd<br />
R    545      1    545    545      0 0x42000000 d5601230 fmd<br />
R    559      1    559    559      0 0x42000000 d55fb128 syslogd<br />
[...]<br />
R    533    494    494    494      0 0x4a014000 d55f19c0 ttymon

We can even lookup, which files or sockets where opened at the moment of the crash dump. For example: We want to know the open files of the ssh daemon. To get this information, we have to use the address of the process from the process table (the eigth column) and extend it with "::pfiles":

> d55f58a8::pfiles<br />
   0  CHR d597d540 /devices/pseudo/mm@0:null<br />
   1  CHR d597d540 /devices/pseudo/mm@0:null<br />
   2  CHR d597d540 /devices/pseudo/mm@0:null<br />
   3 SOCK db688300 socket: AF_INET6 :: 22 

And here we look into the open files of the syslogd

> d55fb128::pfiles<br />
   0  DIR d5082a80 /<br />
   1  DIR d5082a80 /<br />
   2  DIR d5082a80 /<br />
   3 DOOR d699b300 /var/run/name_service_door [door to 'nscd' (proc=d5604890)]<br />
   4  CHR db522cc0 /devices/pseudo/sysmsg@0:sysmsg<br />
   5  REG db643840 /var/adm/messages<br />
   6  REG db6839c0 /var/log/syslog<br />
   7  CHR db522840 /devices/pseudo/log@0:log<br />
   8 DOOR db6eb300 [door to 'syslogd' (proc=d55fb128)]

As the core dump contains all the pages of the kernel (or more, in the case you configure it) you have a frozen state of your system to investigate everything you want. And to get back to my security example: With the core dump and mdb you can gather really interesting informations. For example, you can see that an ssh connection was open at the time of the crash dump.

> ::netstat<br />
TCPv4    St   Local Address        Remote Address       Stack       Zone<br />
db35f980  0    0    0<br />

An example from the field

You can do it like the pros and and look at source code and crash dump side by side to finde the root cause for an error. Or like some colleagues at the Sun Mission Critical Support Center who wouldn´t surprise me, when they find the error by laying their hand on a system). For all others, there is a more simple way to analyse your crash dump to have at least a little bit more informations to search in a bug database. I will use a crash i´ve analysed a long time ago to show you the trick. Okay, you have to start a debugger. I used mdb in this example:

bash-3.00# mdb -k unix.4 vmcore.4<br />
Loading modules: [ unix krtld genunix specfs dtrace cpu.AuthenticAMD.15 uppc pcplusmp ufs md ip sctp usba fcp fctl nca lofs cpc fcip random crypto zfs logindmux ptm sppp nfs ipc ]

A prompt appears, just type in $C to get a stack trace.

<br />
> $C<br />
fffffe80000b9650 vpanic()<br />
fffffe80000b9670 0xfffffffffb840459()<br />
fffffe80000b96e0 segmap_unlock+0xe5()<br />
fffffe80000b97a0 segmap_fault+0x2db()<br />
fffffe80000b97c0 snf_smap_desbfree+0x76()<br />
fffffe80000b97e0 dblk_lastfree_desb+0x17()<br />
fffffe80000b9800 dblk_decref+0x66()<br />
fffffe80000b9830 freeb+0x7b()<br />
fffffe80000b99b0 tcp_rput_data+0x1986()<br />
fffffe80000b99d0 tcp_input+0x38()<br />
fffffe80000b9a10 squeue_enter_chain+0x16e()<br />
fffffe80000b9ac0 ip_input+0x18c()<br />
fffffe80000b9b50 i_dls_link_ether_rx+0x153()<br />
fffffe80000b9b80 mac_rx+0x46()<br />
fffffe80000b9bd0 bge_receive+0x98()<br />
fffffe80000b9c10 bge_intr+0xaf()<br />
fffffe80000b9c60 av_dispatch_autovect+0x78()<br />
fffffe80000b9c70 intr_thread+0x50()

Okay, now start at the beginning of the trace to strip all lines from the operating system infrastructure for error cases. Okay, vpanic() generates the panic. The second line is useless for our purposes to. The next both lines with segmap are generated by the error but not the root cause. The interesting line ist snf_smap_desbfree With this name you can go to Sunsolve or bugs.opensolaris.org. Et voila : System panic due to recursive mutex_enter in snf_smap_desbfree trying to re-aquire Tx mutex. When you type this error into the PatchFinder, you will find a patch fixing this bug: 124255-03 Two hints: