Thursday, 23 May 2013

Crash dump kernel - kdump

This was very niche topic which was interesting and hence dabbed into kernel. I have just provided some of the basics in kernel dump and its configuration parameters. 
Have fun reading below..

I had tested kernel dumping on CentOS-6 (32-bit), with kernel version (2.6.32-279).

Kernel Dump:

Dump means Linux system will write the contents of its memory when a crash occurs, so that they can be later analyzed for the root cause of the crash.

When kernel crashes occur for whatever reason , we must be able to recover from the problem as quickly as possible while collecting as much data available.
The most relevant piece of information for system administrators is the memory dump, taken at the moment of the kernel crash.

There are two ways of collecting the kernel dumps.

1. LKCD.
2. Kdump. 

-  LKCD - Linux Kernel Crash Dump 

LKCD works in two stages.

Stage 1 : Stage when kernel crashes -

LKCD copies the contents of the memory to a temporary storage device, called the dump device, which is usually a swap partition, but it may also be a dedicated crash dump collection partition

Stage 2: Once the system boots back online, LKCD is initiated. 

Next, LKDC runs two commands:
1. lkcd config - which prepares system for next crash.
2. lkcd save - which copies the crash dump data from its temp storage on the dump device to the permanent storage directory called "dump directory".

Lastly, LKCD is a somewhat old utility and might not work well on the modern kernels. In general, it is fairly safe to say it has been replaced by the more flexible Kdump.

Disadvantages of LKCD:

LKCD was unable to save memory dumps to local RAID (md) devices and its network capability was restricted to sending memory cores to dedicated LKCD netdump servers only on the same subnet, provided the cores were under 4GB in size. Memory cores exceeding the 32-bit size barrier were corrupt upon transfer and thus unavailable for analysis.

Overcoming of the above can be by Kdump
Kdump is a much more flexible tool, with extended network-aware capabilities, Indeed, Kdump supports network dumping to a range of devices, including local disks, but also NFS areas, CIFS shares
or FTP and SSH servers.

Kdump working ?

Kdump has two main components.

1. Kdump - crash dump is captured from the freshly booted kernel and not from the context of the crashed kernel. Kdump uses Kexec to boot into a second kernel when ever system crashes, second kernel, often called crash kernel will boot with very little memory which was reserved by the first kernel and will capture the dump image.
2. Kexec - Kexec is a fastboot mechanism that allows booting a Linux kernel from the context of an already running kernel without going through BIOS. BIOS can be very time consuming,
especially on big servers with numerous peripherals

Kdump installation:
- your production kernel must be compiled with a certain set of parameters for kernel crash dumping.
- your production kernel must have the kernel-dump package installed. [ kdump is part of the kexec-tools package ]

NOTE: The version of the kexec-tools package has to be identical to the standard kernel.

- Backup necessaries:
     # cp /boot/grub/grub.conf /boot/grub/grub.conf-$(date '+%d-%b-%Y)
     # cp /boot/grub/menu.lst /boot/grub/menu-$(date '+%d-%b-%Y)
     # cp /etc/sysconfig/kdump /etc/sysconfig/kdump-$(date '+%d-%b-%Y) 

Install the required packages:
NOTE : Point your repository to the debug and install the packages.

#yum --enablerepo=debug install kexec-tools crash kernel-debug kernel-debuginfo-`uname -r`

If you are trying to compile kernel, below parameters can be taken into considerations.

 - You can enable few of the parameters which are to be dumped in "kdump" configuration file.

- Enable kernel crash dumps - Crash dumps need to be enabled. Without this option,Kdump will be useless.
  CONFIG_CRASH_DUMP=y 

- Enable high memory support - set this parameter in order to support memory allocations beyond the 32-bit i..e for 64-bit systems
  CONFIG_HIGHMEM4G=y 

- Enable /proc/vmcore support - Kdump to save the memory dump to /proc/vmcore
  CONFIG_PROC_VMCORE=y

- Configure the kernel with debug info - While this will increase the size of the kernel image, having the symbols available is very useful for in-depth analysis of kernel crashes, as it allows you to
trace the problems not only to problematic function calls causing the crashes, but also the specific lines in relevant sources.
  CONFIG_DEBUG_INFO=y

- Configure the start section for reserved RAM for the crash kernel -  the crash kernel uses a piece of memory specially reserved to it
  CONFIG_PHYSICAL_START=0x1000000

- Configure kdump kernel so it can be identified - Setting this suffix allows kdump to select the right kernel for boot, since there may be several kernels under /boot on your system 
  CONFIG_LOCALVERSION="-kdump"

- Configure KDUMP_RUNLEVEL - If defines the runlevel into which the crash kernel should boot. If you want Kdump to save crash dumps only to a local device, you can set the runlevel to 1. If you want Kdump to save dumps to a network storage area, like NFS, CIFS or FTP, you need the network functionality, which means the runlevel should be set to 3

- Configure KDUMP_IMMEDIATE_REBOOT - This directive tells Kdump whether to reboot out of the crash kernel once the dump is complete 
  KDUMP_IMMEDIATE_REBOOT="yes"

- Configure KDUMP_KEEP_OLD_DUMPS - This settings defines how many dumps should be kept before rotating - default =5.
  KEEP_OLD_DUMPS=5

- Configure KDUMP_DUMPFORMAT - defines the dump format - 
  KDUMP_DUMPLEVEL="ELF"

GRUB memory changes
Kdump works by booting from the context of the crashed kernel. In order for this feature to work, the crash kernel must have a section of memory available, even when the production kernel crashes.
Now, we need to declare how much RAM we want to give our crash kernel. 
if the RAM is smaller than 512M, then don't reserve anything.
if the RAM size is between 512M and 2G (exclusive), then reserve 64M
if the RAM size is larger than 2G, then reserve 128M

- Modify your grub entry, append your kernel line with the memory size.
kernel /vmlinuz ..... .... ... 
   /dev/mapper/VolGroup-rootvg rd ......   .....     quiet crashkernel=128M

- Set kdump to start on boot time.
  #service kdump start
  #chkconfig kdump on
if there exists any error while starting the service, need to reboot the system. then can enable the kdump service.

- Check your kdump status
 #service kdump status
Kdump is operational

- Test dump your kernel. ( trigger a kernel crash dump )
#echo 1 > /proc/sys/kernel/sysrq
#echo c > /proc/sysrq-trigger

- Once your server is rebooted, your crash file gets generated based in the config file ( /etc/kdump.conf) path. In my case it is /var/crash.

- Detail analysis can be done using crash utility, which will be posted next in the blog.