Everything is a Freaking DNS problem - iscsi http://127.0.0.1:8080/blog/taxonomy/term/1268/0 en DRBD2, OCFS2, Unexplained crashes http://127.0.0.1:8080/blog/drbd2-ocfs2-unexplained-crashes <p>I was trying to setup a dual-primary DRBD environment, with a shared disk with either OCFS2 or GFS. The environment is a Centos 5.3 with DRBD82 (but also tried with DRBD83 from testing) .</p> <p>Setting up a single primary disk and running bonnie++ on it worked Setting up a dual-primary disk, only mounting it on one node (ext3) and running bonnie++ worked</p> <p>When setting up ocfs2 on the /dev/drbd0 disk and mounting it on both nodes, basic functionality seemed in place but usually less than 5-10 minutes after I start bonnie++ as a test on one of the nodes , both nodes power cycle with no errors in the logfiles, just a crash.</p> <p>When at the console at the time of crash it looks like a disk IO (you can type , but actions happen) block happens then a reboot, no panics, no oops , nothing. ( sysctl panic values set to timeouts etc )<br /> Setting up a dual-primary disk , with ocfs2 only mounting it on one node and starting bonnie++ causes only that node to crash.</p> <p>On DRBD level I got the following error when that node disappears<br /> <div class="geshifilter"><pre class="text geshifilter-text" style="font-family:monospace;"><ol><li style="font-family: monospace; font-weight: normal;"><div style="font-family: monospace; font-weight: normal; font-style: normal">drbd0: PingAck did not arrive in time.</div></li><li style="font-family: monospace; font-weight: normal;"><div style="font-family: monospace; font-weight: normal; font-style: normal">drbd0: peer( Primary -&gt; Unknown ) conn( Connected -&gt; NetworkFailure )</div></li><li style="font-family: monospace; font-weight: normal;"><div style="font-family: monospace; font-weight: normal; font-style: normal">pdsk(UpToDate -&gt; DUnknown )</div></li><li style="font-family: monospace; font-weight: normal;"><div style="font-family: monospace; font-weight: normal; font-style: normal">drbd0: asender terminated</div></li><li style="font-family: monospace; font-weight: normal;"><div style="font-family: monospace; font-weight: normal; font-style: normal">drbd0: Terminating asender thread</div></li></ol></pre></div><br /> That however is an expected error because of the reboot.</p> <p>At first I assumed OCFS2 to be the root of this problem ..so I moved forward and setup an ISCSI target on a 3rd node, and used that device with the same OCFS2 setup. There no crashes occured and bonnie++ flawlessly completed it test run. </p> <p>So my attention went back to the combination of DRBD and OCFS<br /> I tried both DRBD 8.2 drbd82-8.2.6-1.el5.centos kmod-drbd82-8.2.6-2 and the 83 variant from Centos Testing</p> <p>At first I was trying with the ocfs2 1.4.1-1.el5.i386.rpm verson but upgrading to 1.4.2-1.el5.i386.rpm didn't change the behaviour </p> <p>Both the DRBD as the OCFS mailinglist were fairly supportive pointing me out that it was probably OCFS2 fencing both hosts after missing the heartbeat, and suggested increasing the deathtimetimeout values.</p> <p>I however wanted to confirm that. As I got no entries in syslog I attached a Cyclades err Avocent Terminal server to the device in the hope that I'd capture the last kernel messsages there ... no such luck either.</p> <p>On the OCFS2 mailinlist people pointed out that i'd use netconsole to catch the logs on a remote node<br /> I set up netconsole using</p> <p><div class="geshifilter"><pre class="text geshifilter-text" style="font-family:monospace;"><ol><li style="font-family: monospace; font-weight: normal;"><div style="font-family: monospace; font-weight: normal; font-style: normal">modprobe netconsole netconsole=&quot;@/,@172.16.32.1/&quot;</div></li><li style="font-family: monospace; font-weight: normal;"><div style="font-family: monospace; font-weight: normal; font-style: normal">sysctl -w kernel.printk=&quot;7 4 1 7&quot;</div></li></ol></pre></div></p> <p>After which indeed I catched error on my remote host..<br /> <div class="geshifilter"><pre class="text geshifilter-text" style="font-family:monospace;"><ol><li style="font-family: monospace; font-weight: normal;"><div style="font-family: monospace; font-weight: normal; font-style: normal">[base-root@CCMT-A ~]# nc -l -u -p 6666</div></li><li style="font-family: monospace; font-weight: normal;"><div style="font-family: monospace; font-weight: normal; font-style: normal">(8,0):o2hb_write_timeout:166 ERROR: Heartbeat write timeout to device</div></li><li style="font-family: monospace; font-weight: normal;"><div style="font-family: monospace; font-weight: normal; font-style: normal">drbd0 after 478000 milliseconds</div></li><li style="font-family: monospace; font-weight: normal;"><div style="font-family: monospace; font-weight: normal; font-style: normal">(8,0):o2hb_stop_all_regions:1873 ERROR: stopping heartbeat on all active</div></li><li style="font-family: monospace; font-weight: normal;"><div style="font-family: monospace; font-weight: normal; font-style: normal">regions.</div></li><li style="font-family: monospace; font-weight: normal;"><div style="font-family: monospace; font-weight: normal; font-style: normal">ocfs2 is very sorry to be fencing this system by restarting</div></li></ol></pre></div></p> <p>One'd think that it output over Serial console before it log over the network :) It doesn't .</p> <p>Next step is that I`ll start fiddling some more with the timeout values :) (note the ":)")</p> http://127.0.0.1:8080/blog/drbd2-ocfs2-unexplained-crashes#comments drbd gfs ha iscsi netconsole ocfs2 Wed, 01 Jul 2009 14:01:57 +0000 Kris Buytaert 922 at http://127.0.0.1:8080/blog