ORA-04021: timeout occurred while waiting to lock object的解决办法

版权声明:本文为Buddy Yuan原创文章,未经允许不得转载。原文地址:http://www.dboracle.com/archivers/ORA-04021: TIMEOUT OCCURRED WHILE WAITING TO LOCK OBJECT的解决办法.html
国庆都放假了,咱们还处理了一起小故障。先说说这个故障吧,Acticve DataGuard突然实例宕机。通过查看Alert的Trace发现是LGWR进程把实例宕掉了。具体alert trace如下:

Mon Oct 01 00:15:49 2018
Media Recovery Waiting for thread 2 sequence 309643 (in transit)
Recovery of Online Redo Log: Thread 2 Group 64 Seq 309643 Reading mem 0
  Mem# 0: +DG_DATA/dgskgj/onlinelog/group_64.807.942196417
  Mem# 1: +DG_DATA/dgskgj/onlinelog/group_64.717.936131839
Mon Oct 01 00:15:54 2018
Archived Log entry 186436 added for thread 2 sequence 309642 ID 0x2f689337 dest 1:
Mon Oct 01 00:16:17 2018
Errors in file /oracle/app/product/diag/rdbms/dgskgj/skgj1/trace/skgj_lgwr_14418510.trc:
ORA-04021: timeout occurred while waiting to lock object 
LGWR (ospid: 14418510): terminating the instance due to error 4021
Mon Oct 01 00:16:18 2018
System state dump requested by (instance=1, osid=14418510 (LGWR)), summary=[abnormal instance termination].

可以看到这里首先出现了ORA-04021: timeout occurred while waiting to lock object,紧接着LGWR就terminating实例。所以我们在这里要先看一下LGWR的Trace情况。

Trace file /oracle/app/product/diag/rdbms/dgskgj/skgj1/trace/skgj1_lgwr_14418510.trc
Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit Production
With the Partitioning, Real Application Clusters, Automatic Storage Management, OLAP,
Data Mining and Real Application Testing options
ORACLE_HOME = /oracle/app/product/11.2/db
System name:    AIX
Node name:      drskgj1
Release:        1
Version:        6
Machine:        00F96EC64C00
Instance name: skgj1
Redo thread mounted by this instance: 1
Oracle process number: 23
Unix process pid: 14418510, image: oracle@drskgj1 (LGWR)


*** 2018-10-01 00:16:17.890
*** SESSION ID:(4348.1) 2018-10-01 00:16:17.890
*** CLIENT ID:() 2018-10-01 00:16:17.890
*** SERVICE NAME:(SYS$BACKGROUND) 2018-10-01 00:16:17.890
*** MODULE NAME:() 2018-10-01 00:16:17.890
*** ACTION NAME:() 2018-10-01 00:16:17.890

*** TRACE FILE RECREATED AFTER BEING REMOVED ***

error 4021 detected in background process
ORA-04021: timeout occurred while waiting to lock object 
kjzduptcctx: Notifying DIAG for crash event
----- Abridged Call Stack Trace -----
ksedsts()+240<-kjzdssdmp()+240<-kjzduptcctx()+228<-kjzdicrshnfy()+120<-ksuitm()+5136<-ksbrdp()+4696<-opirip()+1620<-opidrv()+608<-sou2o()+136<-opimai_real()+188<-ssthrdmain()+276<-main()+204<-__start()+112 
----- End of Abridged Call Stack Trace -----

*** 2018-10-01 00:16:17.896
LGWR (ospid: 14418510): terminating the instance due to error 4021
ksuitm: waiting up to [5] seconds before killing DIAG(15991690)

这里可以看到LGWR给的信息也很有限,就是给了一串堆栈的信息,也没有什么特别的其他信息。所以这种情况下,我们就只能先借助MOS查一下ORA-04021: timeout occurred while waiting to lock object是什么问题。通过搜索我们发现文档ORA-04021: timeout occurred while waiting to lock object : DR Instance terminated by LGWR (文档 ID 2183882.1)和我们遇到的问题是一致的。首先怎么判断问题是一致的呢?第一,alter日志报错的方式相同,第二,lgwr的trace堆栈是一样的。那么这篇告诉我们,我们命中了Bug 16717701 – ADG SHOULD GET THE INSTANCE PARSE LOCK WITH A TIMEOUT或者是Bug 11712267 – ACTIVE DATA GUARD DATABASE HUNG ON ‘LIBRARY CACHE: MUTEX X’ WAIT EVENT。

该问题的原因是当通过ADG中的恢复,LGWR将DB INSTANCE状态对象锁定为独占模式。这样的结果是LGWR可以阻止SQL的解析,而SQL的解析也能阻止LGWR。这是非常糟糕的行为。我们可以通过查看下列视图查询这个行为。

SQL> select a.*,b.name from v$sesstat a , v$statname b
  2  where a.statistic#=b.statistic# 
  3  and a.sid=(select distinct sid from v$mystat)
  4  and b.name like '%parse%';
       SID STATISTIC#	   VALUE NAME
---------- ---------- ---------- ------------------------------
       117	  264	       0 ADG parselock X get attempts
       117	  265	       0 ADG parselock X get successes

所以防止这个问题的办法就是
1.先尝试使用选项步骤1,将cursor_sharing更改为force。减少SQL解析的时间。
2.如果再次发生该问题,将隐含参数”_adg_parselock_timeout”设置成500。这个隐含参数是可以动态修改的。这个参数是防止超时的。

关注dbDao.com的新浪微博

扫码加入微信Oracle小密圈,了解Oracle最新技术下载分享资源

TEL/電話+86 13764045638
Email service@parnassusdata.com
QQ 47079569