Chapter 7. High Available File Server

Table of Contents

Introduction
NFS v4
Architecture
Installation
Configuration
Disaster Recovery Setup
Architectures
Simple replication
DRBD and Heartbeat
Resources

Introduction

For some applications or services, you will need a high available file server. Although clustering file systems exist (like OCFS and GFS), in this chapter, we focus on providing a high-available, secure file server based on NFS v4.

Users that are interested in providing network file systems across a heterogenous network of Unix, Linux and Windows systems might want to read about Samba instead.

NFS v4

NFS stands for Network File System and is a well-known distributed file system in Unix/Linux. NFS (and its versions, like NFSv2, NFSv3 and NFSv4) are open standards and defined in various RFCs. It provides an easy, integrated way for sharing files between Unix/Linux systems as systems that have access to NFS mounts see them as part of the (local) file system tree. In other words, /home can very well be an NFS-provided, remote mount rather than a locally mounted file system.

Architecture

NFS at first might seem a bit difficult to master. In the past (NFSv3 and earlier), this might have been true, but with NFSv4, its architecture has become greatly simplified, especially when you use TCP rather than UDP.

Figure 7.1. NFSv3 versus NFSv4

NFSv3 versus NFSv4

When you used NFS v3, next to the NFS daemon (nfsd), the following services need to run as well:

  • rpcbind, which is responsible for dynamically assigning ports for RPC services. The NFS server tells the rpcbind daemon on which address it is listening and which RPC program numbers it is prepared to serve. A client connects to the rpcbind daemon and informs it about the RPC program number the client wants to connect with. The rpcbind daemon then replies with the address that NFS listens on.

    It is the rpcbind daemon which is responsible for handling access control (through the tcp wrappers interface - using the /etc/hosts.allow and hosts.deny files) - for the service "rpcbind". Note though that this access control is then applied to all RPC-enabled services, not only NFS.

  • rpc.mountd implements the NFS mount protocol and exposes itself as an RPC service as well. A client issues a MOUNT request to the rpc.mountd daemon, which checks its list of exported file systems and access controls to see if the client can mount the requested directory. Access controls can both be mentioned within the list of exported file systems as well as through rpc.mountd' tcp wrappers support (for the service "mountd").

  • rpc.lockd supports file locking on NFS, so that concurrent access to the same resource can be streamlined.

  • rpc.statd interacts with the rpc.lockd daemon and provides crash and recovery for the locking services.

With NFSv4, mount and locking services have been integrated in the NFS daemon itself. The rpc.mountd daemon is still needed to handle the exports, but is not involved with network communication anymore (in other words, the client connects directly with the NFS daemon).

Although the high-level architecture is greatly simplified (and especially for the NFS client, since all accesses are now done through the NFS daemon), other daemons are being introduced to further enhance the functionalities offered by NFSv4.

Important

The Linux-NFS software we use in this architecture still requires the NFSv3 daemons to run on the server even when running NFSv4 only.

ID Mapping

The first daemon is rpc.idmapd, which is responsible for translating user and group IDs to names and vice-versa. It runs on both the server and client so that communication between the two can use user names (rather than possibly mismatching user IDs).

Kerberos support

The second daemon (well, actually set of daemons) is rpc.gssd (client) and rpc.svcgssd (server). These daemons are responsible for handling the kerberos authentication mechanism for NFS.

Installation

On both server as well as clients you will need to install the necessary supporting tools (and perhaps kernel settings).

Server-side installation

On the server, first make sure that the following kernel configuration settings are at least met:

CONFIG_NFS_FS=y
CONFIG_NFSD_V4=y
CONFIG_NFSD_V4_ACL=y

Next install the NFS server (which is provided through the nfs-utils package) while making sure that USE="ipv6 nfsv4 nfsv41" are at least set

# emerge nfs-utils

Edit /etc/conf.d/nfs and set the OPTS_RPC_MOUNTD and OPTS_RPC_NFSD parameters to enable NFSv4 and disable NFSv3 and NFSv2 support:

OPTS_RPC_MOUNTD="-V 4 -N 3 -N 2"
OPTS_RPC_NFSD="-N 2 -N 3"

I had to create the following directories as well since SELinux wouldn't allow the init script to create those for me:

# mkdir /var/lib/nfs/{v4root,v4recovery,rpc_pipefs}

Edit /etc/idmapd.conf and set the domain name (this setting should be the same on client and server).

# vim /etc/idmapd.conf
[General]
Domain = internal.genfic.com

You can then start the NFS service.

# run_init rc-service nfs start

Client-side installation

On the clients, make sure that the following kernel configuration settings are at least met:

CONFIG_NFS_FS=y
CONFIG_NFS_V4=y
CONFIG_NFS_V4_ACL=y

Next, install the NFS utilities while making sure that USE="ipv6 nfsv4 nfsv41" are at least set:

# emerge nfs-utils

Edit /etc/idmapd.conf and set the domain name (this setting should be the same on client and server).

# vim /etc/idmapd.conf
[General]
Domain = internal.genfic.com

Finally, start the rpc.statd and rpc.idmapd daemons:

# run_init rc-service rpc.statd start
# run_init rc-service rpc.idmapd start

Configuration

Most NFS configuration is handled through the /etc/exports file.

Shared file systems in /etc/exports

To share file systems (or file structures) through NFS, you need to edit the /etc/exports file. Below is a first example for an exported packages directory:

/srv/nfs                        2001:db8:81:e2:0:26b5:365b:5072/48(rw,fsid=0,no_subtree_check,no_root_squash,sync)
/srv/nfs/gentoo/packages        2001:db8:81:e2:0:26b5:365b:5072/48(rw,sync,no_subtree_check,no_root_squash)

The first one contains the fsid=0 parameter. This one is important, as it tells NFSv4 what the root is of the file systems that it exports. In our case, that is /srv/nfs. A client will then mount file systems relative to this root location (so mount gentoo/packages, not srv/nfs/gentoo/packages).

If you get problems with NFS mounts, make sure to double/triple/quadruple-check the exports file. NFS is not good at logging things, but I have found I always mess up my exports rather than misconfigure daemons or so.

Disaster Recovery Setup

For our NFS server, I will first discuss a high availability setup based on the virtualization setup, but with an additional layer in case of a disaster: replication towards a remote location. Now, in some organizations, disaster recovery requirements might be so high that the organization doesn't want any data loss. For the NFS service, I assume that a (small amount of) data loss is acceptable so that we can use asynchronous replication techniques. This provides us with a simple architecture, manageable and with less impact on distance (synchronous replication has impact on performance) and number of participating nodes (synchronous replication is often only between two nodes).

Next, an alternative architecture will be suggested where high availability is handled differently (and which can be used for disaster recovery as well), giving you enough ideas on the setup of your own highly available NFS architecture.

Architectures

Simple setup

In the simple setup, the high-available NFS within one location is regularly synchronised with a high-available NFS on a different location. The term "regularly" here can be looked at as frequent as you want: be it time-based scheduled (every hour, run rsync to synchronize changes from one system to another) or change-based - the idea is that after a while, the secundary system is synchronized with the primary one.

The setup of a time-based scheduled approach is simply scheduling the following command regularly (assuming the /srv/data location is where the NFS-disclosed files are at).

rsync -auq /srv/data remote-system:/srv/data

Using a more change-based approach requires a bit more scripting, but nothing unmanageable. It basically uses the inotifywait tool to notify the system when changes have been made on the files, and then trigger rsync when needed. For low-volume systems, this works pretty well, but larger deployments will find the overhead of the rsync commands too much.

Dedicated NFS with DRBD and Heartbeat

Another setup for a high available NFS is the popular NFS/DRDB/Heartbeat combination. This is an interesting architecture to look at if the virtualization platform used has too much overhead on the performance of the NFS system (you have the overhead of the virtualization layer, the logical volume layer, DRDB, again logical volume layer). In this case, DRBD is used to synchronize the storage while heartbeat is used to ensure the proper "node" in the setup is up and running (and if one node is unreachable, the other one can take over).

Figure 7.2. Alternative HA setup using DRBD and Heartbeat

Alternative HA setup using DRBD and Heartbeat

Simple replication

Using rsync, we can introduce simple replication between hosts. However, there is a bit more to it to look at then just run rsync...

Tuning file system performance - dir_index

First of all, you definitely want to tune your file system performance. Running rsync against a location with several thousand of files means that rsync will need to stat() each and every file. As a result, you will have peak I/O loads while rsync is preparing its replication, so it makes sense to optimize the file system for this. One of the optimizations you might want to consider is to use dir_index enabled file systems.

When the file system (ext3 or ext4) uses dir_index (ext4 uses this by default), lookup operations on the file system uses hashed b-trees to speed up the operation. This algorithm has a huge influence on the time needed to list directory content. Let's take a quick (overly simplified and thus also incorrect) view at the approach.

Figure 7.3. HTree in a simple example

HTree in a simple example

In the above drawing, the left side shows a regular, linear structure: the top block represents the directory, which holds the references (numbers) of the files. The files themselves are stored in the blocks with the capital letter (which includes metadata of the file, such as modification time - which is used by rsync to find the files it needs to replicate). The small letters contain some information about the file (which is for instance the file name). In the right side, you see a possible example of an HTree (hashed b-tree).

So how does this work? Let's assume you need to traverse all files (what rsync does) and check the modification times. How many "steps" would this take?

  • In the left example (ext2), the code first gets the overview of file identifications. It starts at the beginning of the top block, finds one reference to a file (1), then after it looks where the information about the second file is at. Because the name of a file ("a" in this case) is not of a fixed width, the reference to the file (1) also contains the offset where the next file is at. So the code jumps to the second file (2), and so on. At the end, the code returns "1, 2, 5, 7, 8, 12, 19 and 21" and has done 7 jumps to do so (from 1 to 2, from 2 to 5, ...)

    Next, the code wants to get the modification time of each of those files. So for the first file (1) it looks at its address and goes to the file. Then, it wants to do the same for the second file, but since it only has the number (2) and not the reference to the file, it repeats the process: read (1), jump, read (2) and look at file. To get the information about the last file, it does a sequence similar to "read 1, jump, read 2, jump, read 5, jump, ..., read 21, look at file". At the end, this took 36 jumps. In total, the code used 43 jumps.

  • In the right example (using HTrees), the code gets the references first. The list of references (similar to the numbers in the left code) use fixed structures so within a node, no jumps are needed. So starting from the top node [c;f] it takes 3 jumps (one to [a;b], one to [d;e] and one to [g;h]) to get an overview of all references.

    Next, the code again wants to get the modification time of each of those files. So for the first file (A) it reads the top node, finds that "a" is less than "c" so goes to the first node lower. There, it finds the reference to A. So for file A it took 2 jumps. Same for file B. File c is referred in the top node, so only takes one jump, etc. At the end, this took 14 jumps, or in total 17 jumps.

So even for this very simple example (with just 8 files) the difference is 43 jumps (random-ish reads on the disk) versus 17 jumps. This is, what in algorithm terms, is O(n^2) versus O(nlog(n)) speed: the "order" of the first example goes with n^2 whereas the second one is n.log(n) - "order" here means that the impact (number of jumps in our case) goes up the same way as the given formula - it doesn't give the exact number of jumps. For very large sets of files, this gives a very huge difference (100^2 = 10'000 whereas 100*log(100) = 200, and 1000^2 = 1'000'000 whereas 1000*log(1000) = 3000).

Hence the speed difference.

If you haven't set dir_index while creating the file systems, you can use tune2fs to set it.

# tune2fs -O dir_index /dev/vda1
# umount /dev/vda1
# fsck -Df /dev/vda1

Tuning rsync

Next, take a look at how you want to use rsync. If the replication is on a high-speed network, you probably want to disable the rsync delta algorithm (where rsync tries to look for changes in a file, rather than just replicate the entire, modified file), especially if the files are relatively small:

# rsync -auq --whole-file <source> <targethost>:<targetlocation>

If the data is easily compressable, you might want to enable the compression in rsync:

# rsync -auq --compress <source> <targethost>:<targetlocation>

If you trust the network, you might want to use standard rsync operation rather than SSH. This will not perform the SSH handshake nor encrypt the data transfer so has a better transfer rate (but it does require the rsync daemon to run on the target if you push, or on the source if you pull):

# rsync -auq <source> rsync://<targethost>/<targetlocation>

Rsync for just the changed files

If the load on the system is low (but the number of files is very high), you can use inotifywait to present the list of changes that are occurring live. For each change that occurred, rsync can then be invoked to copy the file (or its delta's) to the remote system:

inotifywait -e close_wait -r -m --format '%w%f' <source> | while read file
do
  rsync -auq $file <targethost>:<targetlocation>/$file
done

This is however a bit hackish and most likely not sufficiently scalable for larger workloads. I mention it here though to show the flexibility of Linux.

DRBD and Heartbeat

If you however require the use of the different approach (DRBD and heartbeat), a bit more work (and most importantly, knowledge) is needed.

Installation

First, install drbd, drbd-kernel and heartbeat.

# emerge drbd drbd-kernel heartbeat

So far for the easy stuff ;-)

Configuring DRBD

Now let's first configure the DRBD replication. In /etc/drbd.d/nfs-disk.res (on both systems - you can pick your own file name, but keep the .res extension), configure the replication as such:

global {
  usage-count yes;
}

common {
  syncer { rate 100M; }
}

resource r0 {
  protocol A;
  handlers {
    pri-on-incon-degr "echo o > /proc/sysrq-trigger ; halt -f";
    pri-lost-after-sb "echo o > /proc/sysrq-trigger ; halt -f";
    local-io-error "echo o > /proc/sysrq-trigger ; halt -f";
  }

  startup {
    degr-wfc-timeout 120;
  }

  disk {
    on-io-error   detach;
  }

  net {
  }

  syncer {
    rate 100M;
    al-extents 257;
  }

on nfs_p {
    device /dev/drbd0;
    disk /dev/sdb1;
    address nfs_p:7788;
    meta-disk /dev/sdc1[0];
  }

on nfs_s {
    device /dev/drbd0;
    disk /dev/sdb1;
    address nfs_s:7788;
    meta-disk /dev/sdc1[0];
  }
}

What we do here is to define the replication resource (r0) and use protocol A (which, in DRBD's terms, means asynchronous replication - use C if you want synchronous replication). On the hosts (with hostnames nfs_p for the primary, and nfs_s for the secundary) the /dev/sdb1 partition is managed by DRBD, and /dev/sdc1 is used to store the DRBD meta-data.

Next, create the managed device on both systems:

# drbdadm create-md r0
# drbdadm up all

You can see the state of the DRBD setup in /proc/drbd. At this point, on both systems it should show their resource be in secundary mode, so enable it on one side:

# drbdsetup /dev/drbd0 primary -o

Now the drbd0 device is ready to be used (so you can now create a file system on it and mount it to your system).

Configuring heartbeat

Next configure heartbeat. There are three files that need some attention: the first one is /etc/ha.d/ha.cf, the second one is an authorization key used to authenticate and authorize requests (so that a third party cannot influence the heartbeat) and the last one identifies the resource(s) itself.

So first the ha.cf file:

logfacility local0
keepalive 1
deadtime 10
bcast eth1
auto_failback on
node nfs_p nfs_s

In /etc/heartbeat/authkeys, put:

auth 3
3 sha1 TopS3cr1tPassword

In /etc/ha.d/haresources put:

nfs_p IPaddr::nfs_p
nfs_p drbddisk::r0 Filesystem::/dev/drbd0::/data::ext3 nfs-kernel-server

On the secundary server, use the same but use nfs_s. This will ensure that, once the secundary server becomes the primary (a failover occurred) this remains so, even when the primary system is back up. If you rather have a switch-over when the primary gets back, use the same file content on both systems.

Start the services

Start the installed services to get the heartbeat up and running.

# rc-update add drbd default
# rc-service drbd start
# rc-update add heartbeat default
# rc-service heartbeat start

Managing DRBD

When you have DRBD running, the first thing you need to do is to monitor its status. I will cover monitoring solutions later, so let's first look at this the "old-school" way.

Run drbd-overview to get a view on the replication state of the system:

# drbd-overview
0:nfs-disk             Connected Primary/Secondary
  UpToDate/UpToDate C r--- /srv/data ext4 150G  97G  52G  65%

Another file you probably want to keep an eye on is /etc/drbd:

# cat /proc/drbd
 0: cs:Connected ro:Primary/Secondary ds:UpToDate/UpToDate A r-----
    ns:0 nr:0 dw:0 dr:656 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:b oos:0

This particular example can be read as "Resource 0 has its connection state (cs) set as Connected. The role (ro) of the disk is Primary, its partner disk has the role Secondary. The disk states (ds) are both UpToDate and the synchronisation protocol is A (asynchronous)."

The "r-----" shows the I/O flags of the resource (more details in the DRBD user guide) whereas the next line gives a general overview of the various performance indicators.

If you need to manually bring down (or up) one or more resources, use the drbdadm command:

# drbdadm down nfs-disk

Split brain

A split brain is something you definitely want to avoid, but need to train on. A split brain occurs when both sides of the synchronisation have believed they are the primary disk (i.e. the other side has failed and, while the primary disk remains active, the secundary disk invoked a fail-over and became active). A DRBD split brain will be detected by DRBD itself when both resources can "see" each other again. When this occurs, the replication is immediately stopped, one disk will be in StandAlone connection state while the other will be in either StandAlone or in WFConnection (WF = Waiting For).

Although you can configure DRBD to automatically resolve a split brain, it is important to understand that this always involves sacrificing one of the nodes: changes made on the node that gets sacrificed will be lost. In our NFS case, you will need to check where heartbeat had the NFS service running. If the NFS service remained running on the primary node (nfs_p) then sacrifice the secundary one.

$ ssh nfs_p cat /proc/drbd
0: cs:Standalone st:Primary/Secondary ds:UpToDate/Unknown C r---
$ ssh nfs_s cat /proc/drbd
0: cs:WFConnection:Primary/Secondary ds:UpToDate/Unknown C r---

Assuming that heartbeat left the NFS service running on the primary node, we can now sacrifice the secondary and reset the synchronisation.

$ ssh nfs_s sudo rc-service heartbeat stop
* Stopping heartbeat...     [ ok ]
$ ssh nfs_p sudo rc-service heartbeat stop
* Stopping heartbeat...     [ ok ]

Now log on to the secundary node (nfs_s), explicitly mark it as a secundary one, reset its state and reconnect:

# drbdadm secondary nfs-disk
# drbdadm disconnect nfs-disk
# drbdadm -- --discard-my-data connect nfs-disk

Now log on to the primary node (nfs_p) and connect the resource again.

# drbdadm connect nfs-disk

Restart the heartbeat service and you should be all set again.

Resources

NFS

DRBD