记一次ora.asm服务无法启动和GPnP Get Item的问题

版权声明:本文为Buddy Yuan原创文章,未经允许不得转载。原文地址:记一次ora.asm服务无法启动和GPnP Get Item的问题
昨天晚上去打DB PSU,这是一套11.2.0.3 RAC。很快我就把数据库和集群全部停止,然后把补丁安装完了。我心想还是蛮顺利的啊,结果在重启集群的时候,突然发现集群没办法起来。通过crsctl check crs可以发现无法启动crsd的服务,而我去查看crsd.log的时候发现根本没有任何记录。一直是空的。然后我通过ocrcheck命令检查会一直hang住,执行crsctl query css votedisk确是好的,很快都能查到。然后我又执行了kfed去读取asm磁盘,发现也是能读的。通过进一步检查,发现是ora.asm服务启动不成功。进一步登陆到asm实例上查看asm的alert日志,发现能读取spfile文件,并且实例是能正常启动的。我通过查看asm的日志发现,磁盘组并没有走到最后一步mount成功的地方。如下所示:

SQL> ALTER DISKGROUP ALL MOUNT /* asm agent call crs *//* {0:0:2} */ 
NOTE: Diskgroups listed in ASM_DISKGROUPS are
 	 DG_FRA
 	 DG_DATA
NOTE: Diskgroup used for Voting files is:
 	 DG_OCR
Diskgroup with spfile:DG_OCR
Diskgroup used for OCR is:DG_OCR
NOTE: cache registered group DG_DATA number=1 incarn=0x50ec3256
NOTE: cache began mount (not first) of group DG_DATA number=1 incarn=0x50ec3256
NOTE: cache registered group DG_FRA number=2 incarn=0x50fc3257
NOTE: cache began mount (not first) of group DG_FRA number=2 incarn=0x50fc3257
NOTE: cache registered group DG_OCR number=3 incarn=0x50fc3258
NOTE: cache began mount (not first) of group DG_OCR number=3 incarn=0x50fc3258
NOTE: Assigning number (1,3) to disk (/dev/diskgroup/dg_data4)
NOTE: Assigning number (1,2) to disk (/dev/diskgroup/dg_data3)
NOTE: Assigning number (1,1) to disk (/dev/diskgroup/dg_data2)
NOTE: Assigning number (1,0) to disk (/dev/diskgroup/dg_data1)
NOTE: Assigning number (2,0) to disk (/dev/diskgroup/dg_fra)
NOTE: Assigning number (3,0) to disk (/dev/diskgroup/dg_ocr)
GMON querying group 1 at 4 for pid 23, osid 128640
NOTE: cache opening disk 0 of grp 1: DG_DATA_0000 path:/dev/diskgroup/dg_data1
NOTE: F1X0 found on disk 0 au 2 fcn 0.0
NOTE: cache opening disk 1 of grp 1: DG_DATA_0001 path:/dev/diskgroup/dg_data2
NOTE: cache opening disk 2 of grp 1: DG_DATA_0002 path:/dev/diskgroup/dg_data3
NOTE: cache opening disk 3 of grp 1: DG_DATA_0003 path:/dev/diskgroup/dg_data4
NOTE: cache mounting (not first) external redundancy group 1/0x50EC3256 (DG_DATA)

可以看到日志前面都是正常能够读盘的,中间做了一些其他的动作,在最后面出现下列错误:

NOTE: detached from domain 3
NOTE: cache dismounted group 3/0x50FC3258 (DG_OCR) 
NOTE: cache ending mount (fail) of group DG_OCR number=3 incarn=0x50fc3258
NOTE: cache deleting context for group DG_OCR 3/0x50fc3258
GMON dismounting group 3 at 9 for pid 23, osid 128640
NOTE: Disk  in mode 0x8 marked for de-assignment
ERROR: diskgroup DG_OCR was not mounted

接下来我通过asm实例进入到数据库查了一下,在正常实例的磁盘组是mounted状态,而异常的数据库实例里面磁盘组的状态是mounting状态。也就是正在mount的状态。这里的一个疑点就是怎么mount不上。
为了把这个疑点弄清楚,我做了一个全局的hang分析。这里必须要做全局的才能知道磁盘组为什么无法mount。

*** 2018-10-26 04:36:34.280
===============================================================================
HANG ANALYSIS:
  instances (db_name.oracle_sid): +asm.+asm2, +asm.+asm1
  oradebug_node_dump_level: 3
  analysis initiated by oradebug
  os thread scheduling delay history: (sampling every 1.000000 secs)
    0.000000 secs at [ 04:36:34 ]
      NOTE: scheduling delay has not been sampled for 0.184529 secs    0.000000 secs from [ 04:36:30 - 04:36:35 ], 5 sec avg
    0.000000 secs from [ 04:35:34 - 04:36:35 ], 1 min avg
    0.000000 secs from [ 04:31:34 - 04:36:35 ], 5 min avg
===============================================================================
 
Chains most likely to have caused the hang:
 [a] Chain 1 Signature: 'GPnP Get Item'<='DFS lock handle'
     Chain 1 Signature Hash: 0xe08712d6
 [b] Chain 2 Signature: 'GPnP Get Item'<='rdbms ipc reply'<='enq: DD - contention'
     Chain 2 Signature Hash: 0xec6b634f
 [c] Chain 3 Signature: 'GPnP Get Item'<='rdbms ipc reply'<='enq: DD - contention'
     Chain 3 Signature Hash: 0xec6b634f
 
===============================================================================
Non-intersecting chains:
 
-------------------------------------------------------------------------------
Chain 1:
-------------------------------------------------------------------------------
    Oracle session identified by:
    {
                instance: 2 (+asm.+asm2)
                   os id: 88049
              process id: 23, oracle@KSPX-DB2 (TNS V1-V3)
              session id: 1427
        session serial #: 3
    }
    is waiting for 'DFS lock handle' with wait info:
    {
                      p1: 'type|mode'=0x43490005
                      p2: 'id1'=0x3c
                      p3: 'id2'=0x2
            time in wait: 49 min 17 sec
           timeout after: never
                 wait id: 107
                blocking: 0 sessions
             current sql: ALTER DISKGROUP ALL MOUNT /* asm agent call crs *//* {0:0:2} */
             short stack: ksedsts()+461<-ksdxfstk()+32<-ksdxcb()+1876<-sspuser()+112<-__sighandler()<-semtimedop()+10<-skgpwwait()+160<-ksliwat()+1865<-kslwaitctx()+163<-kjusuc()+3619<-ksigeti()+3781<-ksbcic_int()+21066<-ksbcic()+12<-kfgbSendPeelWait()+53<-kfgFinalizeMount()+4736<-kfgscFinalize()+1405<-kfgForEachKfgsc()+193<-kfgsoFinalize()+135<-kfgFinalize()+396<-kfxdrvMount()+4781<-kfxdrvEntry()+2218<-opiexe()+20368<-opiosq0()+2948<-kpooprx()+274<-kpoal8()+829<-opiodr()+916<-ttcpip()+2242<-opitsk()+1673<-opiino()+966<-opiodr()+916<-opidr wait history: * time between current wait and wait #1: 0.000149 sec 1. event: 'rdbms ipc reply' time waited: 0.000338 sec wait id: 106 p1: 'from_process'=0x12 p2: 'timeout'=0x7fffffff * time between wait #1 and #2: 0.000054 sec 2. event: 'kfk: async disk IO' time waited: 0.000182 sec wait id: 105 p1: 'count'=0x1 p2: 'intr'=0x0 p3: 'timeout'=0xffffffff * time between wait #2 and #3: 0.000157 sec 3. event: 'rdbms ipc reply' time waited: 0.000372 sec wait id: 104 p1: 'from_process'=0x12 p2: 'timeout'=0x7fffffff } and is blocked by => Oracle session identified by:
    {
                instance: 1 (+asm.+asm1)
                   os id: 15643
              process id: 18, oracle@KSPX-DB1 (RBAL)
              session id: 1117
        session serial #: 1
    }
    which is waiting for 'GPnP Get Item' with wait info:
    {
            time in wait: 221831 min 33 sec
           timeout after: never
                 wait id: 124844244
                blocking: 44 sessions
             current sql: 
             short stack: ksedsts()+461<-ksdxfstk()+32<-ksdxcb()+1876<-sspuser()+112<-__sighandler()<-__poll()+47<-sgipcwWaitHelper()+5271<-sgipcwWait()+475<-gipcWaitOsd()+281<-gipcInternalWait()+14965<-gipcWaitF()+3234<-gipcInternalRecvSync()+8749<-gipcRecvSyncF()+3343<-clsgpnpm_gipcGets()+297<-clsgpnpm_receiveMsg()+408<-clsgpnpm_exchange()+1059<-clsgpnp_profileCallUrlInt()+3568<-clsgpnp_getProfileEx()+325<-clsgpnp_dbmsGetItem()+263<-kggpnpAttrGet()+1272<-kfdParseProfileString()+885<-kfdDiscoverShallow()+2023<-kfgbDriver()+1745<-ksbabs()+
            wait history:
              * time between current wait and wait #1: 0.000007 sec
              1.       event: 'GPnP Initialization'
                 time waited: 0.018700 sec
                     wait id: 124844243       
              * time between wait #1 and #2: 0.000207 sec
              2.       event: 'CSS initialization'
                 time waited: 0.000027 sec
                     wait id: 124844242       
              * time between wait #2 and #3: 0.000062 sec
              3.       event: 'rdbms ipc message'
                 time waited: 0.002787 sec
                     wait id: 124844241       p1: 'timeout'=0x12c
    }

通过全局的hang分析,我们可以发现,在节点2 asm启动之后,运行了”ALTER DISKGROUP ALL MOUNT”的命令,而现在这个命令没有执行成功,被阻塞了40多分钟。它是被节点1的rbal进程阻塞的,而该进程正在等待一个叫 “GPnP Get Item”的等待事件。那么既然我们知道问题的原因了,我们就可以开始在mos上查找相关资料了。通过查看文档Diskgroup Mount Hangs with RBAL Waiting on ‘GPnP Get Item’ and ‘enq: DD – contention’ (文档 ID 1375505.1),发现这个问题是这是由未发布的bug:12398300引起的。这个问题的最根本原因就是gpnp进程stuck了,然后就会导致rbal进程产生等待。解决办法其实很简单,就是干掉节点1的gpnp进程就会释放。而干掉gpnp进程集群不会重启,它只会单独的把该进程重启。
通过kill gpnp之后,故障恢复,节点2的集群顺利启动。
这里有一个技巧,以后针对ora.asm服务无法启动的问题,其实都可以使用全局的hang分析。

关注dbDao.com的新浪微博

扫码加入微信Oracle小密圈,了解Oracle最新技术下载分享资源

TEL/電話+86 13764045638
Email service@parnassusdata.com
QQ 47079569