I recently had to troubleshoot a hung instance in a 2 node RAC system. 4 months earlier, the system was reinstalled in a rolling fashion due to the requirement of Linux Upgrade from Oracle Linux 5 to Oracle Linux 7. This was required because of lack of certification for a storage migration to an AllFlash Storage. The system has been stable when running with Oracle Linux 5 for several years. Around 4 months after the reinstallation, one node got hung with and traces showed these error messages:
Mon Feb 26 08:53:37 2018 skgxpvfynet: mtype: 61 process 15801 failed because of a resource problem in the OS. The OS has most likely run out of buffers (rval: 4) Errors in file /u01/app/oracle/diag_p/diag/rdbms/prod/PROD2/trace/PROD2_ora_15801.trc (incident=480004): ORA-00603: ORACLE server session terminated by fatal error ORA-27504: IPC error creating OSD context ORA-27300: OS system dependent operation:sendmsg failed with status: 105 ORA-27301: OS failure message: No buffer space available ORA-27302: failure occurred at: sskgxpsnd2 Incident details in: /u01/app/oracle/diag_p/diag/rdbms/prod/PROD2/incident/incdir_480004/PROD2_ora_15801_i480004.trc
This was strange because OS /proc/meminfo was showing huge amounts of free memory for this physical host with 512GB of RAM.
[root@node17 ~]# cat /proc/meminfo zzz ***Mon Feb 26 08:53:09 CET 2018 MemTotal: 528028424 kB MemFree: 14593828 kB MemAvailable: 78305772 kB Buffers: 28009752 kB Cached: 46896496 kB SwapCached: 0 kB Active: 22627436 kB Inactive: 66945168 kB Active(anon): 14315300 kB Inactive(anon): 2105748 kB Active(file): 8312136 kB Inactive(file): 64839420 kB Unevictable: 363996 kB Mlocked: 364020 kB SwapTotal: 33554428 kB SwapFree: 33554428 kB Dirty: 404 kB Writeback: 0 kB AnonPages: 15768480 kB Mapped: 709896 kB Shmem: 945384 kB Slab: 2462644 kB SReclaimable: 2092232 kB SUnreclaim: 370412 kB KernelStack: 28336 kB PageTables: 395568 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 87853440 kB Committed_AS: 22087384 kB VmallocTotal: 34359738367 kB VmallocUsed: 1106260 kB VmallocChunk: 34085810172 kB HardwareCorrupted: 0 kB AnonHugePages: 0 kB CmaTotal: 16384 kB CmaFree: 0 kB HugePages_Total: 204800 HugePages_Free: 20214 HugePages_Rsvd: 6266 HugePages_Surp: 0 Hugepagesize: 2048 kB DirectMap4k: 36827756 kB DirectMap2M: 78444544 kB DirectMap1G: 423624704 kB
Oracle Support then referred to this MOS Note:
ORA-27301: OS Failure Message: No Buffer Space Available / ORA-27302: failure occurred at: sskgxpsnd2 ( Doc ID 2322410.1 )
It turned out that on systems with a lot of physical memory and on Oracle Linux 7, the MTU size of loopback adapter lo0 has to be reduced from the default value of 64k to 16436 bytes to avoid memory fragmentation in RAC. The note also mentioned that the parameter vm.min_free_kbytes should be set to physmem * 0,004 *
I was very surprised that neither Cluster Verification Utility (CVU), nor orachk in most recent version did catch this problem at the point of installation. In my opinion, if default value of MTU size of loopback interface on Oracle Linux 7 has the potential to cause an outage, then this must be pre-checked by CVU at installation time or at least integrated into orachk. Unfortunately, this is not the case. It seems that in July we will know if the on-prem release 18.3.0 eventually will catch and enforce this configuration requirement during installation time.