We quickly chose the CentOS distribution as our weapon of choice because you can't beat the price (it's free), it is built on very solid code (RedHat Enterprise Linux), and had a long support lifetime. If I am not mistaken CentOS 5 will be supported for something like 8 years.
Our next decision was to invest in a SAN of some kind so we can speed up data access while adding redundancy with room to grow. We eventually purchased the SR1521 device from Coraid (http://www.coraid.com), and let me tell you, we will buy more devices from them again and again. This machine is unbelievable and uses the ATA over Ethernet protocol to move data quickly over a gigabit network. I will make another post about this device in the near future.
All in all we had 12 servers and a new storage network to use, so we immediately began researching clustering file systems. Being that CentOS is a RedHat derivative we ultimately decided to use GFS as it is natively supported (that doesn't mean that it's easy to setup) and is used in some very large clusters worldwide (which tells us it is production ready). We use GFS to share things like web server directories, various configuration directories and so on. This makes it incredibly easy for us to add a new server into the fold and have it up and running quickly.
I noticed that there isn't a ton of info on this subject on the net and the RedHat documentation was a little confusing so I will share about how we got it working for us.
First things first, when you are installing CentOS 5 be sure to install the Cluster FS option. You can include whatever else you would like, but this package is absolutely necessary. After install I immediately do the following:
yum install ntp
chkconfig ntpd on
service ntpd start
It is vital that the machines in your cluster are in sync as far as time is concerned. If they are out of sync it can cause problems later when more than one machine is accessing the same file at the same time.
The next thing I do is add my GFS mount point folder to the /etc/updatedb.conf file. Basically, this file has a line of all folders to NOT include when updatedb runs. Updatedb is a very nice indexing service that allows you to use the "locate" command to search for files and directories on your machine. A very handy tool, but when you have 11 machines banging on every byte of a multi-terabyte SAN at the exact same time it causes massive problems, and in fact our cluster was crashing EVERY morning between the hours of 4 and 7 am. You can take a look at my frustration here:
http://www.centos.org/modules/newbb/viewtopic.php?topic_id=11432&forum=41
The mount point that we use is /san so I simply added this to the /etc/updatedb.conf file like so:
PRUNEPATHS = "/afs /media /net /sfs /tmp /udev /var/spool/cups /var/spool/squid /var/tmp /san"
When you are starting your cluster with your first machines there are a few files to setup. The first is /etc/lvm/lvm.conf You don't need to use lvm for a GFS filesystem, but we do.
The only thing I do to this file is change the scan directory. Since I am using AoE with the Coraid device I simply changed my scan line to look like this:
scan = [ "/dev/etherd" ]
In our cluster the only logical volume we are using is on the Coraid device and so I didn't feel like scanning all of /dev every time, but you could seemingly keep this file at the defaults or change it like I did to be more specific, and hopefully make boot time a little quicker.
The next file we are going to setup is a biggie and that is the /etc/cluster/cluster.conf Basically, this file is the grandaddy of them all in terms of GFS and tells the cluster who is a member how it should work together and so on. Here is a stripped down version of the file we use:
There are some important things to note here. You will see that I am using hostnames (db1 and db1 here). You use hostnames to identify certain nodes in your cluster and therefore every server needs to have the same entries for these hostnames in their /etc/hosts file. You may be thinking to yourself, "well I could just use DNS for that." It is recommended to use the hosts file because it is quicker (no network latency to look up host names) and inherently more reliable because you don't rely on a few machines to translate names, rather every machine in the cluster knows exactly where the others are. Whenever I modify my hosts file I simply scp the /etc/hosts file to every machine in the cluster.
<?xml version="1.0"?>
<cluster config_version="25" name="san1">
<fence_daemon post_fail_delay="0" post_join_delay="120"/>
<clusternodes>
<clusternode name="db1" nodeid="1" votes="1">
<fence>
<method name="1">
<device name="apc1" port="db1"/>
</method>
</fence>
</clusternode>
<clusternode name="db2" nodeid="2" votes="1">
<fence>
<method name="1">
<device name="apc1" port="db2"/>
</method>
</fence>
</clusternode>
</clusternodes>
<cman/>
<fencedevices>
<fencedevice agent="fence_aoemask" name="fence-e1.0" shelf="1" slot="0" interface="eth1"/>
<fencedevice name="apc1" agent="fence_apc" ipaddr="192.168.2.247" login="somelogin" passwd="xxxxxx"/>
</fencedevices>
<rm>
<failoverdomains/>
</rm>
</cluster>
The second item you will notice in the cluster.conf file is the fencing section. Fencing is ultra ultra important to a GFS cluster. Basically, the cluster needs a way to remove nodes that it deems as unsafe to the cluster as a whole from the data stored on GFS. This ensures reliability of your data. The recommended way to do fencing is through a power switch that can be controlled over the network, but you can also do it at the SAN level. The Coraid device has a way to filter by MAC address and we actually used that for a while, but then switched to the power option because it was easier to work with. I will talk about these specific fencing options in a later post.
I am embarrassed to admit it, but we actually ran our cluster with manual fencing for a while because I didn't know about the Coraid MAC option yet and we hadn't purchased APC power strips yet. Let me just say that you can do it, but you will undoubtedly run into a problem like this. With manual fencing, if a node is dies unexpectedly or is deemed unsafe, GFS doesn't know how to turn the node off, so it does the next best thing and locks EVERYONE out from the filesystem. NONE of your other machines will be able to read data from the GFS cluster until they ALL are rebooted. Oh yah, and CentOS 5 specifically won't respond to the reboot command. It will try to reboot, but will hang forever. You have to physically cut the power to the server or press the power button to bring the machines and the cluster back. Needless to say, this is not a good option if your datacenter is miles away and it is 3 in the morning.
Once you have your /etc/cluster/cluster.conf, /etc/lvm/lvm.conf and /etc/hosts files ready to go you can start your cluster with the following commands:
service cman startThen you can mount your logical volume like this:
service clvmd start
service gfs start
mount -t gfs /dev/san1/lvol0 /san -o noatimeAnother performance booster of note is the "-o noatime" section. atime is a timestamp for the time the file was last accessed. You may in fact need it for your apps, but ours and many others could care less when the last access was. With atime on you are forcing a small write for every read. If you don't need this parameter then using noatime will boost the performance of your GFS volumes.
I said before that we are using the Coraid device so the way we handle starting our GFS cluster at boot is by using the /etc/init.d/aoe-init script. Here is a sample version of this script:
#! /bin/shI then do the following to make sure that these services are only started through the aoe-init script:
# aoe-init - example init script for ATA over Ethernet storage
#
# Edit this script for your purposes. (Changing "eth1" to the
# appropriate interface name, adding commands, etc.) You might
# need to tune the sleep times.
#
# Install this script in /etc/init.d with the other init scripts.
#
# Make it executable:
# chmod 755 /etc/init.d/aoe-init
#
# Install symlinks for boot time:
# cd /etc/rc3.d && ln -s ../init.d/aoe-init S99aoe-init
# cd /etc/rc5.d && ln -s ../init.d/aoe-init S99aoe-init
#
# Install symlinks for shutdown time:
# cd /etc/rc0.d && ln -s ../init.d/aoe-init K01aoe-init
# cd /etc/rc1.d && ln -s ../init.d/aoe-init K01aoe-init
# cd /etc/rc2.d && ln -s ../init.d/aoe-init K01aoe-init
# cd /etc/rc6.d && ln -s ../init.d/aoe-init K01aoe-init
#
case "$1" in
"start")
# load any needed network drivers here
# replace "eth1" with your aoe network interface
ifconfig eth1 up
# time for network interface to come up
sleep 4
modprobe aoe
# time for AoE discovery and udev
sleep 7
# add your raid assemble commands here
# add any LVM commands if needed (e.g. vgchange)
# add your filesystem mount commands here
service cman start
sleep 3
service clvmd start
sleep 3
service gfs start
sleep 3
mount -t gfs /dev/san1/lvol0 /san -o noatime
test -d /var/lock/subsys && touch /var/lock/subsys/aoe-init
# Bring up http after the filesystem is mounted
service httpd start
/usr/bin/memcached -d -m 512 -l 192.168.2.100 -p 11211 -u nobody
;;
"stop")
# Stop http before the filesystem is unmounted
service httpd stop
# add your filesystem umount commands here
umount /san
sleep 3
service gfs stop
sleep 3
service clvmd stop
sleep 3
service cman stop
# deactivate LVM volume groups if needed
# add your raid stop commands here
rmmod aoe
rm -f /var/lock/subsys/aoe-init
;;
*)
echo "usage: `basename $0` {start|stop}" 1>&2
;;
esac
chkconfig gfs offI hope this info was helpful to someone out there. I will edit and add to this post to make it more thorough, I am sure there are small elements I left out.
chkconfig cman off
chkconfig clvmd off