Sunday, 15 May 2016

CentOS/RHEL 7 kernel dump & debug

Applies : CentOS / RHEL / OEL 7 

Arch : x86_64

When kdump enabled, the system is booted from the context of another kernel. This second kernel reserves a small amount of memory, and its only purpose is to capture the core dump image in case the system crashes. Since being able to analyze the core dump helps significantly to determine the exact cause of the system failure.

Configuring kdump :

kdump service comes with kexec-tools package which needs to be installed

#yum install kexec-tools

Modify the amount of memory needs to be configured for kdump and set crashkernel=<size> parameter


# cat /etc/default/grub
GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR="$(sed 's, release .*$,,g' /etc/system-release)"
GRUB_DEFAULT=saved
GRUB_DISABLE_SUBMENU=true
GRUB_TERMINAL_OUTPUT="console"
GRUB_CMDLINE_LINUX="rd.lvm.lv=centos/swap vconsole.font=latarcyrheb-sun16 rd.lvm.lv=centos/root crashkernel=128M  vconsole.keymap=us rhgb quiet"
GRUB_DISABLE_RECOVERY="true"
#

Re-generate grub and reboot to make kernel parameter effect

# grub2-mkconfig -o /boot/grub2/grub.cfg
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-3.10.0-123.el7.x86_64
Found initrd image: /boot/initramfs-3.10.0-123.el7.x86_64.img
Warning: Please don't use old title `CentOS Linux, with Linux 3.10.0-123.el7.x86_64' for GRUB_DEFAULT, use `Advanced options for CentOS Linux>CentOS Linux, with Linux 3.10.0-123.el7.x86_64' (for versions before 2.00) or `gnulinux-advanced-1a06e03f-ad9b-44bf-a972-3a821fca1254>gnulinux-3.10.0-123.el7.x86_64-advanced-1a06e03f-ad9b-44bf-a972-3a821fca1254' (for 2.00 or later)
Found linux image: /boot/vmlinuz-0-rescue-ae1ddf63f5e04857b5e89cd8fcf1f9e1
Found initrd image: /boot/initramfs-0-rescue-ae1ddf63f5e04857b5e89cd8fcf1f9e1.img
done
#

Modify Kump in /etc/kdump.conf

By default vmcore will be stored in /var/crash directory and if you like it needs to be dumped in which ever partition or disk or you want or NFS it must be defined here.

ext3 /dev/sdd1
or
net nfs.yourdomain.com:/export/dump

compress the vmcore file to reduce the size 
core_collector makedumpfile -c

when crash is captured, root fs will be mounted and /sbin/init is run. change the behaviour as below
default reboot

Start your kdump: 

# cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-3.10.0-123.el7.x86_64 root=UUID=1a06e03f-ad9b-44bf-a972-3a821fca1254 ro rd.lvm.lv=centos/swap vconsole.font=latarcyrheb-sun16 rd.lvm.lv=centos/root crashkernel=128M vconsole.keymap=us rhgb quiet

# grep -v  '#' /etc/sysconfig/kdump | sed '/^$/d'
KDUMP_KERNELVER=""
KDUMP_COMMANDLINE=""
KDUMP_COMMANDLINE_APPEND="irqpoll nr_cpus=1 reset_devices cgroup_disable=memory mce=off numa=off udev.children-max=2 panic=10 rootflags=nofail acpi_no_memhotplug"
KEXEC_ARGS=""
KDUMP_BOOTDIR="/boot"
KDUMP_IMG="vmlinuz"
KDUMP_IMG_EXT=""
#

# systemctl enable kdump.service
# systemctl start kdump.service
# systemctl is-active kdump
active
#

Test your configuration 

# echo 1 > /proc/sys/kernel/sysrq
# echo c > /proc/sysrq-trigger



You could see that the crash was generated and we could install debug kernel packages to analyse crash. 

#yum install crash

I was able to download from https://oss.oracle.com/ol7/debuginfo/ and check your kernel version to download the version of debug kernel.

#rpm -ivh kernel-debuginfo-common-x86_64-3.10.0-123.el7.x86_64.rpm \
               kernel-debuginfo-3.10.0-123.el7.x86_64.rpm \
               kernel-debug-debuginfo-3.10.0-123.el7.x86_64.rpm

# ls -lh /var/crash/127.0.0.1-2016.05.15-04\:50\:40/vmcore
-rw-------. 1 root root 168M May 15 04:51 /var/crash/127.0.0.1-2016.05.15-04:50:40/vmcore
#

# crash /var/crash/127.0.0.1-2016.05.15-04\:50\:40/vmcore /usr/lib/debug/lib/modules/`uname -r`/vmlinux

WARNING: kernel version inconsistency between vmlinux and dumpfile

      KERNEL: /usr/lib/debug/lib/modules/3.10.0-123.el7.x86_64/vmlinux
    DUMPFILE: /var/crash/127.0.0.1-2016.05.15-04:50:40/vmcore
        CPUS: 1
        DATE: Sun May 15 04:50:38 2016
      UPTIME: 00:10:24
LOAD AVERAGE: 0.02, 0.07, 0.05
       TASKS: 104
    NODENAME: slnxcen01
     RELEASE: 3.10.0-123.el7.x86_64
     VERSION: #1 SMP Mon Jun 30 12:09:22 UTC 2014
     MACHINE: x86_64  (2294 Mhz)
      MEMORY: 1.4 GB
       PANIC: "Oops: 0002 [#1] SMP " (check log for details)
         PID: 2266
     COMMAND: "bash"
        TASK: ffff880055650b60  [THREAD_INFO: ffff880053fb2000]
         CPU: 0
       STATE: TASK_RUNNING (PANIC)

crash>


crash> bt
PID: 2266   TASK: ffff880055650b60  CPU: 0   COMMAND: "bash"
 #0 [ffff880053fb3a98] machine_kexec at ffffffff81041181
 #1 [ffff880053fb3af0] crash_kexec at ffffffff810cf0e2
 #2 [ffff880053fb3bc0] oops_end at ffffffff815ea548
.
.
.
crash> files
PID: 2266   TASK: ffff880055650b60  CPU: 0   COMMAND: "bash"
ROOT: /    CWD: /root
 FD       FILE            DENTRY           INODE       TYPE PATH
  0 ffff880053c47a00 ffff8800563383c0 ffff880055bad2f0 CHR  /dev/tty1
  1 ffff8800542a9100 ffff88004dd4ff00 ffff88004dc0b750 REG  /proc/sysrq-trigger
.
.
.
That will conclude the article. 

References :