Sunday, December 30, 2007

CentOS 5 + GFS

In our company, we recently decided that we needed to consolidate all our servers and add capacity. In the past we had 3 distinct "clusters" that all satisfied different needs. It was a bit of a mess with machines running Fedora Core 5, 6, and 7 with numerous points of failure amongst the machines. We were throwing all the resources of our company behind one major project and neeeded all our server horsepower to work together harmoniously and be easy to administer with lots of space to grow.

We quickly chose the CentOS distribution as our weapon of choice because you can't beat the price (it's free), it is built on very solid code (RedHat Enterprise Linux), and had a long support lifetime. If I am not mistaken CentOS 5 will be supported for something like 8 years.

Our next decision was to invest in a SAN of some kind so we can speed up data access while adding redundancy with room to grow. We eventually purchased the SR1521 device from Coraid (http://www.coraid.com), and let me tell you, we will buy more devices from them again and again. This machine is unbelievable and uses the ATA over Ethernet protocol to move data quickly over a gigabit network. I will make another post about this device in the near future.

All in all we had 12 servers and a new storage network to use, so we immediately began researching clustering file systems. Being that CentOS is a RedHat derivative we ultimately decided to use GFS as it is natively supported (that doesn't mean that it's easy to setup) and is used in some very large clusters worldwide (which tells us it is production ready). We use GFS to share things like web server directories, various configuration directories and so on. This makes it incredibly easy for us to add a new server into the fold and have it up and running quickly.

I noticed that there isn't a ton of info on this subject on the net and the RedHat documentation was a little confusing so I will share about how we got it working for us.

First things first, when you are installing CentOS 5 be sure to install the Cluster FS option. You can include whatever else you would like, but this package is absolutely necessary. After install I immediately do the following:

yum install ntp
chkconfig ntpd on
service ntpd start


It is vital that the machines in your cluster are in sync as far as time is concerned. If they are out of sync it can cause problems later when more than one machine is accessing the same file at the same time.

The next thing I do is add my GFS mount point folder to the /etc/updatedb.conf file. Basically, this file has a line of all folders to NOT include when updatedb runs. Updatedb is a very nice indexing service that allows you to use the "locate" command to search for files and directories on your machine. A very handy tool, but when you have 11 machines banging on every byte of a multi-terabyte SAN at the exact same time it causes massive problems, and in fact our cluster was crashing EVERY morning between the hours of 4 and 7 am. You can take a look at my frustration here:
http://www.centos.org/modules/newbb/viewtopic.php?topic_id=11432&forum=41

The mount point that we use is /san so I simply added this to the /etc/updatedb.conf file like so:
PRUNEPATHS = "/afs /media /net /sfs /tmp /udev /var/spool/cups /var/spool/squid /var/tmp /san"

When you are starting your cluster with your first machines there are a few files to setup. The first is /etc/lvm/lvm.conf You don't need to use lvm for a GFS filesystem, but we do.
The only thing I do to this file is change the scan directory. Since I am using AoE with the Coraid device I simply changed my scan line to look like this:

scan = [ "/dev/etherd" ]


In our cluster the only logical volume we are using is on the Coraid device and so I didn't feel like scanning all of /dev every time, but you could seemingly keep this file at the defaults or change it like I did to be more specific, and hopefully make boot time a little quicker.

The next file we are going to setup is a biggie and that is the /etc/cluster/cluster.conf Basically, this file is the grandaddy of them all in terms of GFS and tells the cluster who is a member how it should work together and so on. Here is a stripped down version of the file we use:


<?xml version="1.0"?>
<cluster config_version="25" name="san1">
<fence_daemon post_fail_delay="0" post_join_delay="120"/>
<clusternodes>
<clusternode name="db1" nodeid="1" votes="1">
<fence>
<method name="1">
<device name="apc1" port="db1"/>
</method>
</fence>
</clusternode>
<clusternode name="db2" nodeid="2" votes="1">
<fence>
<method name="1">
<device name="apc1" port="db2"/>
</method>
</fence>
</clusternode>
</clusternodes>
<cman/>
<fencedevices>
<fencedevice agent="fence_aoemask" name="fence-e1.0" shelf="1" slot="0" interface="eth1"/>
<fencedevice name="apc1" agent="fence_apc" ipaddr="192.168.2.247" login="somelogin" passwd="xxxxxx"/>
</fencedevices>
<rm>
<failoverdomains/>
</rm>
</cluster>
There are some important things to note here. You will see that I am using hostnames (db1 and db1 here). You use hostnames to identify certain nodes in your cluster and therefore every server needs to have the same entries for these hostnames in their /etc/hosts file. You may be thinking to yourself, "well I could just use DNS for that." It is recommended to use the hosts file because it is quicker (no network latency to look up host names) and inherently more reliable because you don't rely on a few machines to translate names, rather every machine in the cluster knows exactly where the others are. Whenever I modify my hosts file I simply scp the /etc/hosts file to every machine in the cluster.

The second item you will notice in the cluster.conf file is the fencing section. Fencing is ultra ultra important to a GFS cluster. Basically, the cluster needs a way to remove nodes that it deems as unsafe to the cluster as a whole from the data stored on GFS. This ensures reliability of your data. The recommended way to do fencing is through a power switch that can be controlled over the network, but you can also do it at the SAN level. The Coraid device has a way to filter by MAC address and we actually used that for a while, but then switched to the power option because it was easier to work with. I will talk about these specific fencing options in a later post.

I am embarrassed to admit it, but we actually ran our cluster with manual fencing for a while because I didn't know about the Coraid MAC option yet and we hadn't purchased APC power strips yet. Let me just say that you can do it, but you will undoubtedly run into a problem like this. With manual fencing, if a node is dies unexpectedly or is deemed unsafe, GFS doesn't know how to turn the node off, so it does the next best thing and locks EVERYONE out from the filesystem. NONE of your other machines will be able to read data from the GFS cluster until they ALL are rebooted. Oh yah, and CentOS 5 specifically won't respond to the reboot command. It will try to reboot, but will hang forever. You have to physically cut the power to the server or press the power button to bring the machines and the cluster back. Needless to say, this is not a good option if your datacenter is miles away and it is 3 in the morning.

Once you have your /etc/cluster/cluster.conf, /etc/lvm/lvm.conf and /etc/hosts files ready to go you can start your cluster with the following commands:
service cman start
service clvmd start
service gfs start
Then you can mount your logical volume like this:
mount -t gfs /dev/san1/lvol0 /san -o noatime
Another performance booster of note is the "-o noatime" section. atime is a timestamp for the time the file was last accessed. You may in fact need it for your apps, but ours and many others could care less when the last access was. With atime on you are forcing a small write for every read. If you don't need this parameter then using noatime will boost the performance of your GFS volumes.

I said before that we are using the Coraid device so the way we handle starting our GFS cluster at boot is by using the /etc/init.d/aoe-init script. Here is a sample version of this script:
#! /bin/sh
# aoe-init - example init script for ATA over Ethernet storage
#
# Edit this script for your purposes. (Changing "eth1" to the
# appropriate interface name, adding commands, etc.) You might
# need to tune the sleep times.
#
# Install this script in /etc/init.d with the other init scripts.
#
# Make it executable:
# chmod 755 /etc/init.d/aoe-init
#
# Install symlinks for boot time:
# cd /etc/rc3.d && ln -s ../init.d/aoe-init S99aoe-init
# cd /etc/rc5.d && ln -s ../init.d/aoe-init S99aoe-init
#
# Install symlinks for shutdown time:
# cd /etc/rc0.d && ln -s ../init.d/aoe-init K01aoe-init
# cd /etc/rc1.d && ln -s ../init.d/aoe-init K01aoe-init
# cd /etc/rc2.d && ln -s ../init.d/aoe-init K01aoe-init
# cd /etc/rc6.d && ln -s ../init.d/aoe-init K01aoe-init
#

case "$1" in
"start")
# load any needed network drivers here

# replace "eth1" with your aoe network interface
ifconfig eth1 up

# time for network interface to come up
sleep 4

modprobe aoe

# time for AoE discovery and udev
sleep 7

# add your raid assemble commands here
# add any LVM commands if needed (e.g. vgchange)
# add your filesystem mount commands here
service cman start
sleep 3
service clvmd start
sleep 3
service gfs start
sleep 3
mount -t gfs /dev/san1/lvol0 /san -o noatime

test -d /var/lock/subsys && touch /var/lock/subsys/aoe-init

# Bring up http after the filesystem is mounted
service httpd start
/usr/bin/memcached -d -m 512 -l 192.168.2.100 -p 11211 -u nobody
;;
"stop")
# Stop http before the filesystem is unmounted
service httpd stop

# add your filesystem umount commands here
umount /san

sleep 3
service gfs stop
sleep 3
service clvmd stop
sleep 3
service cman stop

# deactivate LVM volume groups if needed
# add your raid stop commands here
rmmod aoe
rm -f /var/lock/subsys/aoe-init
;;
*)
echo "usage: `basename $0` {start|stop}" 1>&2
;;
esac

I then do the following to make sure that these services are only started through the aoe-init script:
chkconfig gfs off
chkconfig cman off
chkconfig clvmd off
I hope this info was helpful to someone out there. I will edit and add to this post to make it more thorough, I am sure there are small elements I left out.

15 comments:

Anonymous said...

Great post!!

We're currently using a coriad device and experiencing issues with NFS across multiple app servers. I may begin the move to a GFS type setup such as this.

Anonymous said...

God bless you, something clear and to the point. Will be trying t his tomorrow on Xen Enterprise 4.1 beta (CentOs5.1) connected to 1520 .... nothing half this clear anywhere out there, yet some of the pieces are quite familar.

Thanks for the war story.

Anonymous said...

Michael - Love the post! Would you mind contacting me via email? I had a few questions for you regarding coraid and GFS.

chrisf (at) lecomputer.com

Thanks!
- Chris

CIDR said...

Excelent post!!

I'm a student from Mexico and I tried to configure a GFS in some servers at UNAM. I read your post and You has given me a clue for get success in the proyect. Thanks!!

Anonymous said...

Thanks for the post, very useful stuff. You are right about the Red Hat documentation - it's incomplete in places ... so I've had to piece together the rest from detailed blogs like yours!

DreadStar said...

Very handy!!

I had a hard time figuring out which packages I had to install for gfs though, as thats missing in most documentation probably assuming you perform a full install.

It seems you need:
yum -y install ntp
yum -y groupinstall Clustering
yum -y install gfs-kmod
yum -y install gfs-utils

DreadStar said...

Does anyone know if the gfs partition has to be the same partiton on all servers or if one gfs cluster node can be run directly on our NAS and the rest of the nodes use a partition on their local FS

Matt said...

I've got to echo the sentiments of my fellow commenters...excellent post.

You helped me get my FS cluster up. Now I just need to decipher how to setup resources and I'll be on my way :-)

Matt said...

@dreadstar

I think, by the nature of gfs, it has to be the same device.

Typically the way it's setup is from a SAN, either fiberchannel or iSCSI

If you've got multiple storage mediums, it sounds like you want some other sort of replication, rather than gfs.

What are you trying to do?

KRISHNA KUMAR said...

hi guys,

Quite cool blog.... I am also trying to setup AOE + RHCS... Since, I am new to rhcs, I have some queries :

1) Is it necessary to have ccs and cman packages installed on your system ? I have cman installed on my centos 5, but I am not able to install ccs.

2)if i take ccs package from centos 5 then i am not able to install cman.

KRISHNA KUMAR said...

CCS package is not necessary if centos 5 is used. It comes with cman. Thanks.

HarusHarris said...

nice posts Michael!

Anonymous said...

Are you still using GFS? Would you recommend it?

Anonymous said...

http://www.djmal.net/thaspot/members/viagrakaufend
[b]VIAGRA Suisse VIAGRA BESTELLEN PREISVERGLECH[/b]
http://www.serataanime.it/forum2/member.php?u=336
[b]VIAGRA information VIAGRA BILLIG PREISVERGLECH BESTELLEN[/b]
VIAGRA BESTELLEN eur 0.85 Pro Pille >> Klicken Sie Hier << BESTELLEN BILLIG VIAGRA CIALIS PFIZER VIAGRA VIAGRA OHNE REZEPT
http://www.sembrasil.com.br/assets/snippets/forum/viewtopic.php?t=145
[b]VIAGRA Nederland VIAGRA REZEPTFREI BILLIG[/b]
[url=http://www.einvestorhelp.com/member.php?u=37776]VIAGRA PREISVERGLECH BESTELLEN[/url] - VIAGRA online kaufen
[b]VIAGRA Holland PREISVERGLECH VIAGRA BESTELLEN[/b]
[b]alternativ zu VIAGRA VIAGRA PREISVERGLECH[/b]
[url=http://www.postyouradforfree.com/showthread.php?p=313013]PREISVERGLECH VIAGRA[/url] - VIAGRA Schweiz
[b]VIAGRA Kaufen VIAGRA PREISVERGLECH[/b]
[b]VIAGRA alternatives VIAGRA BILLIG REZEPTFREI BESTELLEN[/b]
[b]VIAGRA® kaufen
VIAGRA Deutschland
VIAGRA online kaufen
VIAGRA on line
VIAGRA alternativ
VIAGRA rezeptfrei
VIAGRA Kaufen
VIAGRA Apotheke[/b]

Anonymous said...

Thank you a lot! GREAT POST!!!