Everything is a Freaking DNS problem - gfs http://127.0.0.1:8080/blog/taxonomy/term/434/0 en DRBD2, OCFS2, Unexplained crashes http://127.0.0.1:8080/blog/drbd2-ocfs2-unexplained-crashes <p>I was trying to setup a dual-primary DRBD environment, with a shared disk with either OCFS2 or GFS. The environment is a Centos 5.3 with DRBD82 (but also tried with DRBD83 from testing) .</p> <p>Setting up a single primary disk and running bonnie++ on it worked Setting up a dual-primary disk, only mounting it on one node (ext3) and running bonnie++ worked</p> <p>When setting up ocfs2 on the /dev/drbd0 disk and mounting it on both nodes, basic functionality seemed in place but usually less than 5-10 minutes after I start bonnie++ as a test on one of the nodes , both nodes power cycle with no errors in the logfiles, just a crash.</p> <p>When at the console at the time of crash it looks like a disk IO (you can type , but actions happen) block happens then a reboot, no panics, no oops , nothing. ( sysctl panic values set to timeouts etc )<br /> Setting up a dual-primary disk , with ocfs2 only mounting it on one node and starting bonnie++ causes only that node to crash.</p> <p>On DRBD level I got the following error when that node disappears<br /> <div class="geshifilter"><pre class="text geshifilter-text" style="font-family:monospace;"><ol><li style="font-family: monospace; font-weight: normal;"><div style="font-family: monospace; font-weight: normal; font-style: normal">drbd0: PingAck did not arrive in time.</div></li><li style="font-family: monospace; font-weight: normal;"><div style="font-family: monospace; font-weight: normal; font-style: normal">drbd0: peer( Primary -&gt; Unknown ) conn( Connected -&gt; NetworkFailure )</div></li><li style="font-family: monospace; font-weight: normal;"><div style="font-family: monospace; font-weight: normal; font-style: normal">pdsk(UpToDate -&gt; DUnknown )</div></li><li style="font-family: monospace; font-weight: normal;"><div style="font-family: monospace; font-weight: normal; font-style: normal">drbd0: asender terminated</div></li><li style="font-family: monospace; font-weight: normal;"><div style="font-family: monospace; font-weight: normal; font-style: normal">drbd0: Terminating asender thread</div></li></ol></pre></div><br /> That however is an expected error because of the reboot.</p> <p>At first I assumed OCFS2 to be the root of this problem ..so I moved forward and setup an ISCSI target on a 3rd node, and used that device with the same OCFS2 setup. There no crashes occured and bonnie++ flawlessly completed it test run. </p> <p>So my attention went back to the combination of DRBD and OCFS<br /> I tried both DRBD 8.2 drbd82-8.2.6-1.el5.centos kmod-drbd82-8.2.6-2 and the 83 variant from Centos Testing</p> <p>At first I was trying with the ocfs2 1.4.1-1.el5.i386.rpm verson but upgrading to 1.4.2-1.el5.i386.rpm didn't change the behaviour </p> <p>Both the DRBD as the OCFS mailinglist were fairly supportive pointing me out that it was probably OCFS2 fencing both hosts after missing the heartbeat, and suggested increasing the deathtimetimeout values.</p> <p>I however wanted to confirm that. As I got no entries in syslog I attached a Cyclades err Avocent Terminal server to the device in the hope that I'd capture the last kernel messsages there ... no such luck either.</p> <p>On the OCFS2 mailinlist people pointed out that i'd use netconsole to catch the logs on a remote node<br /> I set up netconsole using</p> <p><div class="geshifilter"><pre class="text geshifilter-text" style="font-family:monospace;"><ol><li style="font-family: monospace; font-weight: normal;"><div style="font-family: monospace; font-weight: normal; font-style: normal">modprobe netconsole netconsole=&quot;@/,@172.16.32.1/&quot;</div></li><li style="font-family: monospace; font-weight: normal;"><div style="font-family: monospace; font-weight: normal; font-style: normal">sysctl -w kernel.printk=&quot;7 4 1 7&quot;</div></li></ol></pre></div></p> <p>After which indeed I catched error on my remote host..<br /> <div class="geshifilter"><pre class="text geshifilter-text" style="font-family:monospace;"><ol><li style="font-family: monospace; font-weight: normal;"><div style="font-family: monospace; font-weight: normal; font-style: normal">[base-root@CCMT-A ~]# nc -l -u -p 6666</div></li><li style="font-family: monospace; font-weight: normal;"><div style="font-family: monospace; font-weight: normal; font-style: normal">(8,0):o2hb_write_timeout:166 ERROR: Heartbeat write timeout to device</div></li><li style="font-family: monospace; font-weight: normal;"><div style="font-family: monospace; font-weight: normal; font-style: normal">drbd0 after 478000 milliseconds</div></li><li style="font-family: monospace; font-weight: normal;"><div style="font-family: monospace; font-weight: normal; font-style: normal">(8,0):o2hb_stop_all_regions:1873 ERROR: stopping heartbeat on all active</div></li><li style="font-family: monospace; font-weight: normal;"><div style="font-family: monospace; font-weight: normal; font-style: normal">regions.</div></li><li style="font-family: monospace; font-weight: normal;"><div style="font-family: monospace; font-weight: normal; font-style: normal">ocfs2 is very sorry to be fencing this system by restarting</div></li></ol></pre></div></p> <p>One'd think that it output over Serial console before it log over the network :) It doesn't .</p> <p>Next step is that I`ll start fiddling some more with the timeout values :) (note the ":)")</p> http://127.0.0.1:8080/blog/drbd2-ocfs2-unexplained-crashes#comments drbd gfs ha iscsi netconsole ocfs2 Wed, 01 Jul 2009 14:01:57 +0000 Kris Buytaert 922 at http://127.0.0.1:8080/blog On the Future of Lustre http://127.0.0.1:8080/blog/future-lustre <p><a href="//www.prnewswire.com/cgi-bin/stories.pl?ACCT=104&amp;STORY=/www/story/09-12-2007/0004661104&amp;EDATE" rel="nofollow"><br /> So Sun bought ClusterFS</a>. I`m wondering what their focus will be now. What will be the prime platform on which Lustre will be developed Solaris or Linux ? Will other efforts in the open source cluster filesystem area react on this ? Will Lustre development speed up ? Will management become less complex ?<br /> Time will tell .. I`m keeping an eye on it</p> http://127.0.0.1:8080/blog/future-lustre#comments cluster gfs ha linux lustre sun Fri, 14 Sep 2007 09:53:03 +0000 Kris Buytaert 448 at http://127.0.0.1:8080/blog LinuxConference Europe 2007 2/X http://127.0.0.1:8080/blog/node/439 <p>Sunday evening was the conference dinner, someone tought he was really funny to have us all walk about 3Km more than we needed to. The instructions on the back of our entrance tickets gave us a full tour of the Cambridge suburbs, I should have followed my guts.. not the people trying to read the instructions, would have saved us half an our at least .<br /> Luckily we took the short way back. Dinner was typically english .. nuff said :) </p> <p>So monday started out with a whole bunch of sessions related to filesystems and storage.<br /> Bryn M Reeves gave a really good intro to LVM , then Jan Blunck took over and started talking about how to scale the Device Mapper snap shot solution. I tried to see Dag's talk on dstat but I`ll have to try again at T-Dose as I missed the largest part of the talk due to some phone calls :(<br /> Next up was Olaf "thank god I`m not doing nfs anymore" Kirch (who also listens when you just shout Lars in the streets of Cambridge) who introduced us to iSNS. </p> <p>So after lunch the filesystem track continued with Steven Whitehouse talking about VFS and cluster filesystems , Jorn talking to us about the future of Flashdisks ant their appropriate filesystems, and Chris Mason from Oracle finishing of with a talk on Btrfs.. pronounced "ButterFS" </p> <p>There were 2 different busses to the Duxford air museum which was a bit of a pity since both groups didn't really meet eachother, so it wasn't really a social event where you could chat and meed with everybody at the conference.</p> http://127.0.0.1:8080/blog/node/439#comments btrfs butterfs cambridge cluster gfs linuxconf.eu.2007 linuxconference ukuug Tue, 04 Sep 2007 10:54:53 +0000 Kris Buytaert 439 at http://127.0.0.1:8080/blog