What about a “Cluster Filesystem”?
May 28th, 2008
I have been striving around the open source scene for a cluster filesystem that fits to my needs for some time now. I found out about Redhat’s GFS, Apache’s Hadoop that seems to be very similar to Google’s Filesystem and I read a lot about GlusterFS that I did not attach much value to till today because there are no ready-made packages available and I suspected the term “GlusterFS” and the strange logo designs on it’s website a bit. I also struggled with NFS for some time and had a look at MogileFS.
However, it turns out that GluterFS is exactly what I need. Some thoughts about the different filesystems I read about:
Hadoop, which is written in Java, is made for very large files inside heavy computing tasks. It is not made for small files and development of a FUSE driver does not seem to be high priority. It also comes with a grid computing engine and is, as said, created for processing large amounts of data (e.g. big search engines, large datastores that do not change for long times i.e. videos) instead of small shared datastores. Therefore it is not an option.
I used NFS not long ago (for about a year) to distribute a single data store to all of our servers but it stands out in bad performance as far as I can tell. There are so many absurdities inside NFS that make it imho practically unusable in production environments. Our servers are located inside different datacenters that are hosted by the provider of our choice, but everytime when it came to minimal data loss in between the datacenters, NFS ran amok even when configured with TCP/IP. Another point was, that I was not able to load balance the NFS server on different machines because I did not want to use drdb on the partitions supplied by the hoster’s default debian installation to keep the possibility to be supported on heavy failures. NFS works to some extend but used with more than 3 servers it starts to mutate into a performance bottleneck and general headaces.
I am currently just using rsync to distribute the webserver root folder to all nodes. This may be the simplest way but it’s not meant to be a solution for the long term because update rate is bad and it gets even worse with every node added to the cluster. There is also no possibility to modify file contents or to add/remove files from somewhere else than the server - the nodes must operate read-only.
GFS turns out to be more a package of utilities to handle requests of different servers to a commonly shared storage. Because I need to build the infrastructure for our servers on top of default-configured cheap linux boxes, this is not an option, yet.
MogileFS is a filesystem living in userspace that is able to balance files between different storage nodes (e.g. 3 copies per file). Unfortunately it is written in Perl (does this perform?) and FUSE support seems to be an initially unintended byproduct in a very early stage. MogileFS is divided into differnt types of modules that hold the namespace inside a relational database (trackers) and the storage nodes. I am not convinced that MogileFS will scale and will be easy to use when used as a general filesystem with FUSE, but the idea behind MogileFS as an automatically balancing storage network is absolutely great.
GlusterFS (GNU Cluster Filesystem) is completely distributed with no single point of failure and as easy to maintain like NFS but has some major benefits when it comes to webserver clusters. It is possible to setup some boxes as the underlying datastore servers with automatic file replication done on client or server side. There is no special filesystem required to run it because it just sits on the already existing (ext3 in my case) partitions and exports directories in a similar manner as NFS. It also allows adding clients to the cluster without the need to copy around configuration files because the servers are able to submit the required configuration to connecting clients. However there are no official debian packages available yet to automate the installation tasks and you will need a patched (and newer) FUSE kernel module (at least on debian etch) to get it to work and to support distributed flock() calls (it will not configure the fuse client module against the default 2.5.bla FUSE in etch) [Update: Found some debian packages released by the GlusterFS guys on their homepage - may be worth a try; Update: These packages seem to be too old, but the maintainer has been informed - use the source instead: rpmbuild it (rpmbuild -bb glusterfs.spec) and convert it to deb (alien glusterfs-1.3.9.rpm, dpkg -i glusterfs_1.3.9.deb) but note that you will also need a recent FUSE]. There are also possibilites to automatically replicate files of specific types to big clusters (e.g. all *.jpg files replicated to at least 3 datastores with the unify translator). However, I’m quite new to it and ran it inside virtual boxes only yet but as far as I can tell it will scale easily by nature and supply a good infrastructure against data loss. I really wonder why GlusterFS has not got more attention, yet.
There are other clustered filesystems available for such clusters, too, but most of them seem to be outdated or are still in an early stage. If you know about another filesystem for my needs, feel free to comment and I’ll give it a look. So far I will try out GlusterFS in a real environment very soon if I do not find something better.
(P.S: Yes, technically versed people will say that a cluster filesystem like I’m describing does not ship with a functional data store, but I will use this term because this is what most people expect from it, including me.)
May 28th, 2008 at 19:42
> “I really wonder why GlusterFS has not got more attention, yet.”
GlusterFS is very new project compared to any clustering filesystem. Only recently it got a stable release. Hence its not yet very popular. Feel free to write to its mailing list if you have any issues.
May 28th, 2008 at 20:57
Good article.. We constantly look for feedback from the community to improve. Like Amar pointed out, we are relatively new. We got frustrated with other cluster file systems and eventually ended up writing a new one.
Particularly for web requirements, next release (currently in QA) will have web embeddable feature (Apache / Lighttpd GlusterFS client to bypass kernel), Distributed BerkeleyDB backend (efficient for very small files storage), Binary Protocol (significant improvement for small files I/O) and Non-blocking I/O communication. You will appreciate when 1.4 gets ready (2 months from now).
Happy Hacking,
Anand Babu
May 28th, 2008 at 21:44
Thanks, I’m really excited of the planned lighttpd integration.
However I am still wondering if the self-healing feature, which is imho a key feature of glusterfs but does not work well currently as far as I can tell, will be improved. A script to run on returning nodes to check for consistency of a given underlying AFR datastore, that also replicates files created in the meantime, will suffice – but without modifying file mtimes. I know this is not an easy task and may not be possible at all, but to solve this, maybe the next paragraph may be worth some thoughts:
I also got an idea on how to combine AFR and UNIFY into a single easier to use translator (maybe I will also post it to the list someday): Both AFR and UNIFY features could be easily joined with some sort of “BALANCE” translater that specifies within a switch statement similar to the one in unify, how many times a file will be saved on the underlying datastores. Let’s assume following immaginary configuration:
Each server exports a datastore and a namespace is present somewhere. Clients then are configured to use them as following:
volume balance
type cluster/balance
subvolumes www1, www2, www3, www4, www5
option switch *:3
option scheduler rr
option namespace www-ns
end-volume
this meaning, that files of every type will be saved in round-robin to three of the five datastores for fault-tolerance. Let’s assume one server node crashes. Then is is recovered by deleting all files on the crashed node and restarting it. Afterwards a script on the namespace is executed that will check all datastores for consistency (all files have at least 3 copies). If a file stands out, it will be automatically replicated to the remaining amount of datastores again to assure consistency. Adding and removing servers would be very easy. It also could work like AFR when used with “option switch *:N”, like striping with “option switch *:1″ or like all sorts of raid. This approach would be similar to MogileFS functionality and would be much easier than mixing AFR and Unify including script-based healing that could also be automated.
regards
Daniel
August 13th, 2008 at 12:08
I found your site on technorati and read a few of your other posts. Keep up the good work. I just added your RSS feed to my Google News Reader. Looking forward to reading more from you down the road!