As a DBA we have heard the old axiom that 80-90% of database performance issues are query related. I have a similar axiom about Oracle RAC: 90% of all cluster startup issues are either disk (voting/ocr) or interconnect related.
Today I forgot the second part of that axiom when I could not get two nodes of a three node cluster started. I ignored the interconnect part because I checked it first with ifconfig and the NIC was up on all three nodes. Secondly, because the error I would get in the ocssd.log file went on and on about:
clssnmReadDskHeartbeat: node(1) is down. rcfg(2) wrtcnt(4715) LATS(1135926) Disk lastSeqNo(4715)
incrementing and rapidly enlarging the log. After changing several settings on the multipathing and in /etc/udev/rules.d, I tried the old test:
ping -b 1.1.1.255
That is, perform a broadcast ping on the full range of the interconnect. Nodes 2 and 3 could see each other and node 1 could see only itself. Once I fixed the private VLAN issue, all was well and the cluster came straight up.
The moral of this story is that just because it walks like a disk problem, talks like a disk problem and acts like a disk problem, in clusterware, it might just be a network issue.