Monday 23 May 2016

NFS common errors and troubleshooting - Linux/Unix

I have seen some of the most common NFS Error/Issues which occurs in very common now and then to most of Linux/Unix based system admins. So I decided to put at one palace. Hope this helps most of them.

Environment: Linux/Unix

Error: "Server Not Responding"

Check your NFS server and the client using RPC message and they must be functional/online. 

use ping, traceroute to check are they reaching each other, if not check your NIC using ethtool to verify IP address.

sometimes due to heavy server or network loads causes the RPC message response to time out causing error message. try to increase timeout option.

Error: "rpc mount export: RPC: Timed out " 

NFS server or client was unable to resolve DNS. check forward/reverse DNS name resolution works. 
Check your DNS servers or /etc/hosts

 Error: "Access Denied" or "Permission Denied"

check export permission for the NFS file systems.
#showmount -e nfsserver  ==> client 
#exportfs -a ==> server

check you dont have any syntax issues in file /etc/exports(e.g  space, permissions, typos..etc) 

Error: "RPC: Port mapper failure - RPC: Unable to receive"

NFS requires both NFS service and portmapper service running on both client and the server

#rpcinfo -p
       or
#/etc/init.d/portmap status

if not, start the portmap service

Error: "NFS Stale File Handle"

system call 'open' calls to access NFS file in the same way application uses local file they by returns a file descriptor or handle which programs useses I/O commands to identify the file manipulations

When an NFS file share is either unshared or NFS server changes the file handler, and any NFS client which attempts to do further I/O on the share will receive the 'NFS Stale File Handler'.

on the client :

umount -f /nfsmount or if it is unable to inmount and remount 
kill the processes which uses that /nfsmount

or 

incase if above options didn't work, you can reboot the client to clear the stale NFS.

Error: "No route to host"

this could be reported when client attempts to mount the NFS file system, even when the client can ping them successfully.

This can be due to RPC messages being filtered by either host firewall, client firewall or network switch. verify firewall rules. 
stop suing iptables and try to check the port 2049 

Hope this helps all who might use NFS most of the times. I have figured out these commonly in my experience.

Thanks for sharing !

Sunday 15 May 2016

CentOS/RHEL 7 kernel dump & debug

Applies : CentOS / RHEL / OEL 7 

Arch : x86_64

When kdump enabled, the system is booted from the context of another kernel. This second kernel reserves a small amount of memory, and its only purpose is to capture the core dump image in case the system crashes. Since being able to analyze the core dump helps significantly to determine the exact cause of the system failure.

Configuring kdump :

kdump service comes with kexec-tools package which needs to be installed

#yum install kexec-tools

Modify the amount of memory needs to be configured for kdump and set crashkernel=<size> parameter


# cat /etc/default/grub
GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR="$(sed 's, release .*$,,g' /etc/system-release)"
GRUB_DEFAULT=saved
GRUB_DISABLE_SUBMENU=true
GRUB_TERMINAL_OUTPUT="console"
GRUB_CMDLINE_LINUX="rd.lvm.lv=centos/swap vconsole.font=latarcyrheb-sun16 rd.lvm.lv=centos/root crashkernel=128M  vconsole.keymap=us rhgb quiet"
GRUB_DISABLE_RECOVERY="true"
#

Re-generate grub and reboot to make kernel parameter effect

# grub2-mkconfig -o /boot/grub2/grub.cfg
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-3.10.0-123.el7.x86_64
Found initrd image: /boot/initramfs-3.10.0-123.el7.x86_64.img
Warning: Please don't use old title `CentOS Linux, with Linux 3.10.0-123.el7.x86_64' for GRUB_DEFAULT, use `Advanced options for CentOS Linux>CentOS Linux, with Linux 3.10.0-123.el7.x86_64' (for versions before 2.00) or `gnulinux-advanced-1a06e03f-ad9b-44bf-a972-3a821fca1254>gnulinux-3.10.0-123.el7.x86_64-advanced-1a06e03f-ad9b-44bf-a972-3a821fca1254' (for 2.00 or later)
Found linux image: /boot/vmlinuz-0-rescue-ae1ddf63f5e04857b5e89cd8fcf1f9e1
Found initrd image: /boot/initramfs-0-rescue-ae1ddf63f5e04857b5e89cd8fcf1f9e1.img
done
#

Modify Kump in /etc/kdump.conf

By default vmcore will be stored in /var/crash directory and if you like it needs to be dumped in which ever partition or disk or you want or NFS it must be defined here.

ext3 /dev/sdd1
or
net nfs.yourdomain.com:/export/dump

compress the vmcore file to reduce the size 
core_collector makedumpfile -c

when crash is captured, root fs will be mounted and /sbin/init is run. change the behaviour as below
default reboot

Start your kdump: 

# cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-3.10.0-123.el7.x86_64 root=UUID=1a06e03f-ad9b-44bf-a972-3a821fca1254 ro rd.lvm.lv=centos/swap vconsole.font=latarcyrheb-sun16 rd.lvm.lv=centos/root crashkernel=128M vconsole.keymap=us rhgb quiet

# grep -v  '#' /etc/sysconfig/kdump | sed '/^$/d'
KDUMP_KERNELVER=""
KDUMP_COMMANDLINE=""
KDUMP_COMMANDLINE_APPEND="irqpoll nr_cpus=1 reset_devices cgroup_disable=memory mce=off numa=off udev.children-max=2 panic=10 rootflags=nofail acpi_no_memhotplug"
KEXEC_ARGS=""
KDUMP_BOOTDIR="/boot"
KDUMP_IMG="vmlinuz"
KDUMP_IMG_EXT=""
#

# systemctl enable kdump.service
# systemctl start kdump.service
# systemctl is-active kdump
active
#

Test your configuration 

# echo 1 > /proc/sys/kernel/sysrq
# echo c > /proc/sysrq-trigger



You could see that the crash was generated and we could install debug kernel packages to analyse crash. 

#yum install crash

I was able to download from https://oss.oracle.com/ol7/debuginfo/ and check your kernel version to download the version of debug kernel.

#rpm -ivh kernel-debuginfo-common-x86_64-3.10.0-123.el7.x86_64.rpm \
               kernel-debuginfo-3.10.0-123.el7.x86_64.rpm \
               kernel-debug-debuginfo-3.10.0-123.el7.x86_64.rpm

# ls -lh /var/crash/127.0.0.1-2016.05.15-04\:50\:40/vmcore
-rw-------. 1 root root 168M May 15 04:51 /var/crash/127.0.0.1-2016.05.15-04:50:40/vmcore
#

# crash /var/crash/127.0.0.1-2016.05.15-04\:50\:40/vmcore /usr/lib/debug/lib/modules/`uname -r`/vmlinux

WARNING: kernel version inconsistency between vmlinux and dumpfile

      KERNEL: /usr/lib/debug/lib/modules/3.10.0-123.el7.x86_64/vmlinux
    DUMPFILE: /var/crash/127.0.0.1-2016.05.15-04:50:40/vmcore
        CPUS: 1
        DATE: Sun May 15 04:50:38 2016
      UPTIME: 00:10:24
LOAD AVERAGE: 0.02, 0.07, 0.05
       TASKS: 104
    NODENAME: slnxcen01
     RELEASE: 3.10.0-123.el7.x86_64
     VERSION: #1 SMP Mon Jun 30 12:09:22 UTC 2014
     MACHINE: x86_64  (2294 Mhz)
      MEMORY: 1.4 GB
       PANIC: "Oops: 0002 [#1] SMP " (check log for details)
         PID: 2266
     COMMAND: "bash"
        TASK: ffff880055650b60  [THREAD_INFO: ffff880053fb2000]
         CPU: 0
       STATE: TASK_RUNNING (PANIC)

crash>


crash> bt
PID: 2266   TASK: ffff880055650b60  CPU: 0   COMMAND: "bash"
 #0 [ffff880053fb3a98] machine_kexec at ffffffff81041181
 #1 [ffff880053fb3af0] crash_kexec at ffffffff810cf0e2
 #2 [ffff880053fb3bc0] oops_end at ffffffff815ea548
.
.
.
crash> files
PID: 2266   TASK: ffff880055650b60  CPU: 0   COMMAND: "bash"
ROOT: /    CWD: /root
 FD       FILE            DENTRY           INODE       TYPE PATH
  0 ffff880053c47a00 ffff8800563383c0 ffff880055bad2f0 CHR  /dev/tty1
  1 ffff8800542a9100 ffff88004dd4ff00 ffff88004dc0b750 REG  /proc/sysrq-trigger
.
.
.
That will conclude the article. 

References :