TIP 104#: Node Eviction in RAC 11gr2 due to temporary network hiccups on heartbeat communication

In next couple months, I will examine different eviction scenarios in RAC 11gR2 and when I see some strange behavior, I will do my best to inform it in this blog, so RAC lover, please stay tuned and here is the scenario number 1 :

AS you know, in 11gR2, oracle uses UDP protocol for heartbeats between the nodes.
In this post, I present a node eviction scenario when UDP communication is blocked between the nodes and you see that depends on where and how UDP is blocked, a different situation could occur.
The test is done on two node RAC in 11.2.0.3PSU3 version on Linux

Scenario 1: When UDP communication is blocked on the second node


In this scenario, outgoing UDP for ocssd process on node2 is blocked.
To do so, we find out UDP port on which ocssd is listening and will disable any outgoing traffic

netstat -a --inet |grep -i udp | grep -i racnode2

udp        0      0 racnode2-priv:14081         *:*                                     
udp        0      0 racnode2-priv:52358         *:*                                     
udp        0      0 racnode2-priv:52242         *:*                                     
udp        0      0 racnode2-priv:42517         *:*  --> ocssd
udp        0      0 racnode2-priv:31126         *:*                                     
udp        0      0 racnode2-priv:60741         *:*  

[root@racnode2 ~]# lsof -i :42517
COMMAND    PID USER   FD   TYPE DEVICE SIZE NODE NAME
ocssd.bin 3005 grid   55u  IPv4  22340       UDP racnode2-priv:42517 


42517 is the port which ocssd on racnode2 sends its heartbeat.
To break heartbeat communication between node2 and node1, any outgoing traffic on racnode2 for ocssd process (port 42517) is blocked with this command

iptables -A OUTPUT -s 192.168.2.152 -p udp --sport 42517 -j DROP

alter log on node1 shows that node2 is evicted (bootless eviction) and then it is reconfigured and is joined the cluster.

[cssd(3015)]CRS-1612:Network communication with node racnode2 (2) missing for 50% of timeout interval.  Removal of this node from cluster in 14.840 seconds
2012-12-31 10:50:54.213
[cssd(3015)]CRS-1611:Network communication with node racnode2 (2) missing for 75% of timeout interval.  Removal of this node from cluster in 6.820 seconds
2012-12-31 10:50:58.232
[cssd(3015)]CRS-1610:Network communication with node racnode2 (2) missing for 90% of timeout interval.  Removal of this node from cluster in 2.810 seconds
2012-12-31 10:51:01.056
[cssd(3015)]CRS-1607:Node racnode2 is being evicted in cluster incarnation 249572820; details at (:CSSNM00007:) in /u01/app/11.2.0/grid/log/racnode1/cssd/ocssd.log.
2012-12-31 10:51:02.584
[cssd(3015)]CRS-1625:Node racnode2, number 2, was manually shut down
2012-12-31 10:51:02.590
[cssd(3015)]CRS-1601:CSSD Reconfiguration complete. Active nodes are racnode1 .
2012-12-31 10:51:02.630
[crsd(3393)]CRS-5504:Node down event reported for node 'racnode2'.
2012-12-31 10:51:05.827
[crsd(3393)]CRS-2773:Server 'racnode2' has been removed from pool 'Generic'.
2012-12-31 10:51:05.829
[crsd(3393)]CRS-2773:Server 'racnode2' has been removed from pool 'ora.orcl'.
2012-12-31 10:51:37.987
[cssd(3015)]CRS-1601:CSSD Reconfiguration complete. Active nodes are racnode1 racnode2 .
2012-12-31 10:52:13.720
[crsd(3393)]CRS-2772:Server 'racnode2' has been assigned to pool 'Generic'.
2012-12-31 10:52:13.720
[crsd(3393)]CRS-2772:Server 'racnode2' has been assigned to pool 'Generic'.
2012-12-31 10:52:13.720
[crsd(3393)]CRS-2772:Server 'racnode2' has been assigned to pool 'ora.orcl'.

ocssed on racnode2 has more details, I highlighted couple key lines

2012-12-31 10:51:01.129: [    CSSD][3019058064]###################################
2012-12-31 10:51:01.129: [    CSSD][3019058064]clssscExit: CSSD aborting from thread clssnmvKillBlockThread
2012-12-31 10:51:01.129: [    CSSD][3019058064]###################################
.
.
2012-12-31 10:51:02.559: [    CSSD][3029027728]clssgmClientShutdown: total iocapables 0
2012-12-31 10:51:02.559: [    CSSD][3029027728]clssgmClientShutdown: graceful shutdown completed.
2012-12-31 10:51:02.559: [    CSSD][3029027728]clssnmSendManualShut: Notifying all nodes that this node has been manually shut down
.
.
2012-12-31 10:51:25.352: [    CSSD][3040868032]clssscmain: Starting CSS daemon, version 11.2.0.3.0, in (clustered) mode with uniqueness value 1356979885
2012-12-31 10:51:25.353: [    CSSD][3040868032]clssscmain: Environment is production
.
.
2012-12-31 10:51:26.167: [GIPCHTHR][3024477072] gipchaWorkerCreateInterface: created local interface for node 'racnode2', haName 'CSS_racnode-cluster', inf 'udp://192.168.2.152:29788'

A key note here is that udp is reconfigured to be run on different port and after that node2 is able to join the cluster and starts up all its resources.
The following netstat also confirms that ocssd.bin listens on the new port

[root@racnode2 ~]# netstat -a --inet |grep -i udp | grep -i racnode2
udp        0      0 racnode2-priv:31126         *:*                                     
udp        0      0 racnode2-priv:35489         *:*                                     
udp        0      0 racnode2-priv:38321         *:*                                     
udp        0      0 racnode2-priv:60741         *:*                                     
udp        0      0 racnode2-priv:10321         *:*                                     
udp        0      0 racnode2-priv:29788         *:*   --> new port is created.....
   
[root@racnode2 working]# lsof -i :29788
COMMAND    PID USER   FD   TYPE DEVICE SIZE NODE NAME
ocssd.bin 5919 grid   52u  IPv4 764918       UDP racnode2-priv:29788 

If I sum up, the following sequence of events occurs :

  •  udp communication for heartbeat is blocked (outgoing udp on ocssd port)
  •  Node1 evicts Node2
  •  Node2 is able to stop all IO capable resources and as the result, no need to boot the node (11g  feature).
  •  Node2 starts CSSD and reconfigures UDP port
  •  Node2 is able to join the cluster

This sounds perfect as node2 is able to recover by itself. It looks like transparent and straight forward recovery.
Let see how this failure is recovered if UDP hiccups occur on node1 (master node in two node RAC)


Scenario 2: When UDP communication is blocked on the first node

To follow the same step, UDP port for hearbeat is found and it is blocked as it is shown in below

bash-3.2$ netstat -a --inet |grep -i udp | grep -i racnode1
udp        0      0 racnode1-priv:36613         *:*   
udp        0      0 racnode1-priv:36892         *:*   
udp        0      0 racnode1-priv:26055         *:*   
udp        0      0 racnode1-priv:13167         *:*   
udp        0      0 racnode1-priv:17914         *:*   
udp        0      0 racnode1-priv:51067         *:*   

[root@racnode1 ~]# lsof -i :36613
COMMAND    PID USER   FD   TYPE DEVICE SIZE NODE NAME
ocssd.bin 3010 grid   55u  IPv4  19676       UDP racnode1-priv:36613 

To block heartbeat, all outgoing traffic on port 36613 is blocked
iptables -A OUTPUT -s 192.168.2.151 -p udp --sport 36613 -j DROP

Based on scenario 1, I expected to see the same sequence of events. In other words, I expected to see node2 is evicted and is reconfigured and is rejoined the cluster.
However, in this case, it is seen that node2 is evicted and then as it is shown in below cssd is hung in start up and joining the cluster.

[root@racnode1 ~]# crsctl check cluster -all
**************************************************************
racnode1:
CRS-4537: Cluster Ready Services is online
CRS-4529: Cluster Synchronization Services is online
CRS-4533: Event Manager is online
**************************************************************
racnode2:
CRS-4535: Cannot communicate with Cluster Ready Services
CRS-4530: Communications failure contacting Cluster Synchronization Services daemon
CRS-4534: Cannot communicate with Event Manager
**************************************************************
[root@racnode2 ~]# crsctl stat res -init -t
--------------------------------------------------------------------------------
NAME           TARGET  STATE        SERVER                   STATE_DETAILS       
--------------------------------------------------------------------------------
Cluster Resources
--------------------------------------------------------------------------------
ora.asm
      1        ONLINE  OFFLINE                               Abnormal Termination
ora.cluster_interconnect.haip
      1        ONLINE  OFFLINE                                                   
ora.crf
      1        ONLINE  ONLINE       racnode2                                     
ora.crsd
      1        ONLINE  OFFLINE                                                   
ora.cssd
      1        ONLINE  OFFLINE                               STARTING      
ora.cssdmonitor
      1        ONLINE  ONLINE       racnode2                                     
ora.ctssd
      1        ONLINE  OFFLINE                                                   
ora.diskmon
      1        OFFLINE OFFLINE                                                   
ora.drivers.acfs
      1        ONLINE  ONLINE       racnode2                                     
ora.evmd
      1        ONLINE  OFFLINE                                                   
ora.gipcd
      1        ONLINE  ONLINE       racnode2                                     
ora.gpnpd
      1        ONLINE  ONLINE       racnode2                                     
ora.mdnsd
      1        ONLINE  ONLINE       racnode2          

Even,unblocking the same port with dropping the rule from iptables does not help and still CSS on node2 is not able to join the cluster.

iptables -L      
      
iptables -D OUTPUT -s 192.168.2.151 -p udp --sport 36613 -j DROP


[root@racnode1 ~]# iptables -L      
Chain INPUT (policy ACCEPT)
target     prot opt source               destination         

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination         

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination  


After reviewing all logs (quite length log, so I avoid to copy it here!), It is seen that ocssd in node2 complains about network heartbeat and no reconfig attempt is done, also in node1, after blocking UDP, the interface was disabled and no try is done to setup communication on differently.
As I mentioned earlier, Although UDP port is unblocked, still the following error is reported on node2 and node1 repeatedly.

Node 2
==========
 [    CSSD][3013077904]clssnmvDHBValidateNcopy: node 1, racnode1, has a disk HB, but no network HB, DHB has rcfg 249572810, wrtcnt, 181183, LATS 1471304, lastSeqNo 181182, uniqueness 1356788464, timestamp 1356790643/1482244
 
Node1 
=========
[GIPCHALO][3023862672] gipchaLowerProcessNode: no valid interfaces found to node for 25790 ms, node 0xa062a88 { host 'racnode2', haName 'CSS_racnode-cluster', srcLuid be28e076-9f3aafb1, dstLuid 61a1f895-ba260945 numInf 0, contigSeq 2639, lastAck 2626, lastValidAck 2638, sendSeq [2627 : 2683], createTime 4294328280, sentRegister 1, localMonitor 1, flags 0x2408 }

It turned out that the issue is reported as Bug 14281269 : "NODE CAN'T REJOIN THE CLUSTER AFTER A TEMPORARY INTERCONNECT FAILURE - PROBLEM:after an interconnect failure on the first node the second node restarts the clusterware (rebootless restart) as expected, but can't join the cluster again till the interconnect interface of node1
is not shutdown/startup manually "
At the time of posting this, there is no patch available and the suggested workaround is to bounce interconnect interface.

In my test, even bouncing node2 (evicted node) did not help and I ended up to kill gipc daemon on node1 (master node/surviving node) and it did help and whole cluster recovered and node2 was able to join the cluster.

[root@racnode1 working]# ps -ef |grep -i gipc
grid      2961     1  0 05:40 ?        00:00:16 /u01/app/11.2.0/grid/bin/gipcd.bin
root      7709  4792  0 06:44 pts/1    00:00:00 grep -i gipc
[root@racnode1 working]# kill -9 2961
[root@racnode1 working]# ps -ef |grep -i gipc
grid      7717     1 15 06:44 ?        00:00:00 /u01/app/11.2.0/grid/bin/gipcd.bin
root      7755  4792  0 06:44 pts/1    00:00:00 grep -i gipc


[/u01/app/11.2.0/grid/bin/oraagent.bin(3528)]CRS-5822:Agent '/u01/app/11.2.0/grid/bin/oraagent_grid' disconnected from server. Details at (:CRSAGF00117:) {0:1:5} in /u01/app/11.2.0/grid/log/racnode1/agent/crsd/oraagent_grid/oraagent_grid.log.
2012-12-29 06:44:56.972
[/u01/app/11.2.0/grid/bin/orarootagent.bin(3535)]CRS-5822:Agent '/u01/app/11.2.0/grid/bin/orarootagent_root' disconnected from server. Details at (:CRSAGF00117:) {0:2:23} in /u01/app/11.2.0/grid/log/racnode1/agent/crsd/orarootagent_root/orarootagent_root.log.
2012-12-29 06:44:56.974
[/u01/app/11.2.0/grid/bin/oraagent.bin(3741)]CRS-5822:Agent '/u01/app/11.2.0/grid/bin/oraagent_oracle' disconnected from server. Details at (:CRSAGF00117:) {0:5:63} in /u01/app/11.2.0/grid/log/racnode1/agent/crsd/oraagent_oracle/oraagent_oracle.log.
2012-12-29 06:44:57.098
[ohasd(2414)]CRS-2765:Resource 'ora.ctssd' has failed on server 'racnode1'.
2012-12-29 06:44:59.141
[ctssd(7732)]CRS-2401:The Cluster Time Synchronization Service started on host racnode1.
2012-12-29 06:44:59.141
[ctssd(7732)]CRS-2407:The new Cluster Time Synchronization Service reference node is host racnode1.
2012-12-29 06:45:01.164
[cssd(3010)]CRS-1601:CSSD Reconfiguration complete. Active nodes are racnode1 racnode2 .
2012-12-29 06:45:02.363
[crsd(7759)]CRS-1012:The OCR service started on node racnode1.
2012-12-29 06:45:03.155
[evmd(7762)]CRS-1401:EVMD started on node racnode1.
2012-12-29 06:45:05.147
[crsd(7759)]CRS-1201:CRSD started on node racnode1.
2012-12-29 06:45:38.798
[crsd(7759)]CRS-2772:Server 'racnode2' has been assigned to pool 'Generic'.
2012-12-29 06:45:38.800
[crsd(7759)]CRS-2772:Server 'racnode2' has been assigned to pool 'ora.orcl'.


===== alert for node2 =========

2012-12-29 06:39:38.132
[cssd(7700)]CRS-1605:CSSD voting file is online: /dev/sda1; details in /u01/app/11.2.0/grid/log/racnode2/cssd/ocssd.log.
2012-12-29 06:45:01.165
[cssd(7700)]CRS-1601:CSSD Reconfiguration complete. Active nodes are racnode1 racnode2 .
2012-12-29 06:45:03.641
[ctssd(8061)]CRS-2401:The Cluster Time Synchronization Service started on host racnode2.
2012-12-29 06:45:03.641
[ctssd(8061)]CRS-2407:The new Cluster Time Synchronization Service reference node is host racnode1.
2012-12-29 06:45:05.257
[ohasd(2405)]CRS-2767:Resource state recovery not attempted for 'ora.diskmon' as its target state is OFFLINE
2012-12-29 06:45:16.836
[ctssd(8061)]CRS-2408:The clock on host racnode2 has been updated by the Cluster Time Synchronization Service to be synchronous with the mean cluster time.
2012-12-29 06:45:25.475
[crsd(8199)]CRS-1012:The OCR service started on node racnode2.
2012-12-29 06:45:25.541
[evmd(8079)]CRS-1401:EVMD started on node racnode2.
2012-12-29 06:45:27.331
[crsd(8199)]CRS-1201:CRSD started on node racnode2.
2012-12-29 06:45:35.642
[/u01/app/11.2.0/grid/bin/oraagent.bin(8321)]CRS-5016:Process "/u01/app/11.2.0/grid/opmn/bin/onsctli" spawned by agent "/u01/app/11.2.0/grid/bin/oraagent.bin" for action "check" failed: details at "(:CLSN00010:)" in "/u01/app/11.2.0/grid/log/racnode2/agent/crsd/oraagent_grid/oraagent_grid.log"
2012-12-29 06:45:36.181
[/u01/app/11.2.0/grid/bin/oraagent.bin(8347)]CRS-5011:Check of resource "orcl" failed: details at "(:CLSN00007:)" in "/u01/app/11.2.0/grid/log/racnode2/agent/crsd/oraagent_oracle/oraagent_oracle.log"
2012-12-29 06:45:37.301
[/u01/app/11.2.0/grid/bin/oraagent.bin(8321)]CRS-5016:Process "/u01/app/11.2.0/grid/bin/lsnrctl" spawned by agent "/u01/app/11.2.0/grid/bin/oraagent.bin" for action "check" failed: details at "(:CLSN00010:)" in "/u01/app/11.2.0/grid/log/racnode2/agent/crsd/oraagent_grid/oraagent_grid.log"

To conclude, in 2 node RAC :
 




  1. Network hiccups on heartbeat port on node2 is recovered automatically.
  2. Network hiccups on heartbeat port on node1 requires manual intervention due to bug 14281269
  3. Due to several reported bug, it is recommended to be on 11203 PSU3 at least.check out metalink note for other bugs: List of gipc defects that prevent GI from starting/joining after network hiccups [ID 1488378.1])

 




2 comments:

admin said...

Very good article.

I have also listed top 4 Reasons for node reboot or node eviction at
http://www.dbas-oracle.com/2013/06/Top-4-Reasons-Node-Reboot-Node-Eviction-in-Real-Application-Cluster-RAC-Environment.html

Shervin said...

Thanks for comments.