Tiny ceph

Submitted by ajlill on Fri, 07/29/2016 - 14:58


Ceph was designed to provide reliable high-performance storage at the petabyte scale, and a lot of the tools and documentation reflect that bias. I don't have petabytes of data, but I do have terrabytes of data spread across  a couple of dozen filesystems on three physical hosts, along with a dozen VMs that I'd like to be able to migrate without having to manually copy disk images. I decided to give CEPH a try.

For the rest of this article, I'll assume that you've already read up on ceph architecture and have at least three boxes to set up you cluster. I'll also be using the default ceph configuration unless noted. There are some things that are not clearly spelled out so here they are. ceph-deploy can be installed on any system. It's a python script that sshes to the boxes in your cluster and runs the actual ceph commands using sudo. To use it you need to pick a host, know as the admin host in the docs, and make sure that you can ssh to your cluster nodes as a non-root user without a password and run sudo, also without a password.

This setup will be going against all best practices, CEPH's designed for massive deployments, with monitors, OSDs, MDSs, and clients all on separate hardware. Apparently using the rbd kernel driver can interfere with OSDs on the same box. Also, data is not distributed evenly, so you cluster can become "full" at 60% - 70%.

Installing CEPH

First off, install the software from the repositories on download.ceph.com, if possible. They have deb and rpms for every version of ceph on every version of debian and redhat/centos that they can build on.. Compiling it is a pain, I tried. For me, the latest version that will build for wheezy is hammer (0.94). Later versions use C++ 11 features, so you need gcc version 4.9 or higher.

I'm building a 3 node cluster, so I installed ceph ceph-mds radosgw on each of my physical boxes. I originally tried to use ceph-deploy install, but it defaults to installing jewel, which doesn't compile in wheezy. I later stumbled across the command line option to make ceph-deploy install the version of choice, --release name, but for 3 nodes, why bother.

Next step was bootstrapping the monitors. Use ceph-deploy new node1 node2 node3. If it doesn't work, you'll have to remove the contents of /etc/ceph and /var/lib/ceph/mon on each of the nodes to try over.  Now, if you are not using jewel or later, this will fail because they changed the details of how permissions need to be created. To fix this, edit the file gatherkeys.py (it will be under /usr/lib/python* somewhere, and on line 78 change 'allow *' to 'allow'. If you didn't catch it before running ceph-deploy new, you don't have to start over, just do ceph-deploy gatherkeys. You should now have a working set of monitors.

ceph-mon should now be running on all your nodes, and  ceph status should show you a running cluster with 3 monitors. It will say HEALTH_CRIT until you create some osd nodes.

The next step is to add OSDs to you cluster. Unless you are giving an entire disk, eg /dev/sdb, or just using a directory for the osd, do not use ceph-deploy to setup the disk. If you give ceph-deploy a block device, like a logical volume or an existing partition, it will try to partition it, which will work, but ceph-deploy won't be able to find them, and it will fail.

Before you create the OSD, you need to consider  the journal. This is used to boost performance for small writes. it needs to be big enough to hold 10 seconds of I/O, and at least 128M. The default size is 512M, which was nearly 10x more than I needed to support my I/O. If you have some SSD space, you obviously want to put them there. Also, since the journal is just a single file that doen't change size, you don't need a filesystem to get in the way, so use a logical volume or partition.

The second thing to consider is the number of replicas. The default pool 'rdb' is set to size 3 and min_size 2, meaning 3 copies of each bit of data, and it will refuse to do I/O if there are less than 2 copies available. Since I don't want to take a 50% space hit moving from RAID1, I'll turn those down to 2 and 1. using the command ceph osd pool set {poolname} size {num-replicas}. Of course, this is not recommended. You can also set the default values in /etc/ceph/ceph.conf using

osd pool default size = 2  # Write an object 2 times.
osd pool default min size = 1 # Allow writing one copy in a degraded state.

in either the global or osd section. Now we can create OSDs.

  1. Create a filesystem on whatever device you want to use. XFS is the currently recomended fstype.
  2. Run ceph osd create. This will print an number, which is the next available osd number.
  3. Mount your filesystem on /var/lib/ceph/osd/ceph-{OSDNUM}.
  4. If you want your journal on another (faster device) create it and symlink it to /var/lib/ceph/osd/ceph-{OSDNUM}/journal
  5. Run ceph-osd -i {OSDNUM} --mkfs --mkkey
  6. From your admin host, run ceph-deploy osd activate hostname:/var/lib/ceph/osd/ceph-{OSDNUM}
  7. If you are deploying on jessie, the systemd support is shit, you will have to run ceph-disk -v activate --mark-init sysvinit --mount /var/lib/ceph/osd/ceph-{OSDNUM} manually on the node to make it finish and start the osd.

Using CEPH, part 1

First thing I tried was to fire up one of my VM's on a CEPH block device. I'm using Xen, without the benefit of libvirt, so I'll be using the kernel module method. If you are using libvirt or qemu, they may have support for using CEPH directly. First, create the device: rbd create --size 1024 foo, then make the device visible to the OS, rbd map rbd/foo --id admin. I will probably have to script that so it happens whenever I try to start the VM, or on reboot. In either case, /dev/rbd/rbd/foo should now exist and can be used as any device.

Since I wanted to see what kind of a hit I would take moving from direct local access, via dm-cache, First I ran the phoronix test suite on my VM, then shut it down, copied the filesytem to the CEPH device, and fired it up and repeated the test. First thing I noticed during the copy and the first test run was that my journals were constantly filling up. Whenever this happens the OSD pauses to flush the cache, so this is obviously going to negatively affect the results. Maximum network I/O topped out at 200Mbits. CPU usage on my newer boxes was about 20%. Baseline is an 11 hour run time, and my first CEPH test came in at 15 hours.

Obviously, I want to try this again without the journals filling up. On my baseline run I topped out at 50Mb/s in and out, so 100Mb*10s is about 1G, so I'll resize my journals to 1G and re-run. The procedure for this is

  1. ceph osd set noout
  2. /etc/init.d/ceph stop osd.N
  3. ceph-osd -i N --flush-journal
  4. Resize the journal. This may mean changing partition of lv sizes. If it's a file, just remove it and let the command re-create it. If you are putting them on an SSD partition, you may want to add something like osd journal = /srv/ceph/osd$id/journal in ceph.conf, rather than messing with symlinks..
  5. ceph-osd -i N --mkjournal
  6. /etc/init.d/ceph start osd.N
  7. ceph osd unset noout

My third run lasted 12.5 hours. Now, these wallclock times are somewhat meaningless, since the phoronix test suite will run each test a variable number of times until it gets consistent results. My baseline run ran each test thrice, whereas the second usually needed 4 or 5 runs, so I need to delve into the individual test results. From these, the results are mixed. Dbench, postmark, and AIO-stress test showed a major hit to performance under CEPH. The rest showed anywhere from a 20% hit to a 20% improvement. Looking deeper, it seem write performance is taking a hit 10%, but read performance can actually be better under CEPH, especially with multiple readers. This makes sense, since simultaneous reads can go to multiple OSDs in parallel, where the local ones are funneled through a single device.

In production, I want to have the mapping of the rbd volume happen automatically when I started a XEN VM. I found a hotplug script that handles mapping and unmapping. It only needed a slight tweak, I added --id admin to the map command. Use the syntax in the readme file instead of what's documented in the script.

Making a filesystem

Now on to play with filesystems. First thing is to setup the metadata servers. I'm going to run one on each server, one active and two standby. Trying ceph-deploy mds create failed because we've edited the /etc/conf/ceph.conf file. So I had to run it again with the --overwrite-conf option. To avoid loosing my changes, I copied the ceph.conf file from one of my servers to the come directory on my admin box and ceph-deploy copied it back. To create a filesystem, you need two pools, one for the data and one for the metadata. The documentation claims that these are created by default, but they weren't on my system.

Pool creation

When you create these, you have to specify the placement group sizes. The formula is number of OSDs * 100 / replicas rounded to the next power of two. These placement groups consume CPU and memory, and can only be increased. On the one hand, you want to keep these as small as possible. On the other hand, if you change them, the cluster will have to re-balance, which could trigger a lot of traffic. Now to re-balance the 20G I was using at this point only took less than a minute .What they don't explicitly mention that the placement groups for each pool are additive, and you don't want to go above 300 placement groups per OSD. Exceeding this can cause performance problems and ceph status will give a health warning. So really, the formula should be more like ( OSDs * 100 ) / ( replicas * pools ).

If you wind up with too many placement groups per OSD, you have a few options. First, add more OSDs, second, drop a pool and re-create it, third tell ceph to STFU. The parameter you want is mon_pg_warn_max_per_osd, and you can either set it to a higher value or 0 to disable. You can add it to the ceph.conf file and it will be picked up when the mon's reboot. You can also inject the new parameter into the running mon's with the command  ceph tell mon.* injectargs '--mon_pg_warn_max_per_osd=350'

Making filesystems, part 2

Now that I have the pools, I can create a filesystem. ceph fs new <name> <metadata> <data> Now, to mount it, you need an ID and key, so use ceph auth list and use admin and it's key: mount -t ceph <mon1 IP>,<mon2 IP>,<mon3 IP>:/ /u/ceph -o name=admin,secret=hfhfjkdfhudishgbds== You need to use the actual IP addresses and not hostname, you also can't use You don't even need any ceph software installed on the client if you have proper kernel support.

I set mythtv to use it for some low-priority recordings, and resource usage was unnoticeable from background activity.

Tuning and growing

Set mon_osd_down_out_subtree_limit = host in the [global] section. This will allow one of the hosts to go down without all of the OSDs being marked out and causing a massive rebuild.

Set osd max backfills = 1 to limit the amount of traffic caused by adding or deleting an osd. You can change this on the fly with the command ceph tell osd.* injectargs '--osd-max-backfills 1'


Since my cluster was working ok, I picked up a few 3T disks on sale and decided to add them to the cluster. This caused a few wobblies because my existing OSDs were on the order of 100M. When you add an OSD to a cluster it is assigned a weight which is roughly how many Tb it is. I has spare slots on two of my three hosts, so I could add these without shifting data around on my existing filesystems, but on the third host I was planning to move some data onto the cluster so I could de-commission a small (500G) disk, and replace it with the final 3Tb drive. Anyway, the addition of these two disks caused a number of placement groups to become unclean. The workaround was to slowly increase the weight of the OSD on the box without a 3Tb drive until they fixed themselves using the command ceph osd crush reweight name weight. I'll probably have to use it again as I will likely be disassembling my existing volume groups on raids in a piecewise manner and reassigning the disks to the cluster.

Finally, I'll remove the tiny test OSDs from the cluster. For small clusters, it's recommended that you re-weight the OSD you want to remove to 0 and wait for the data to migrate before setting it out.

  1. ceph osd crush reweight osd.0 0
  2. ceph -w and wait for the data migration to stop
  3. ceph osd out 0
  4. /etc/init.d/ceph stop osd.0
  5. ceph osd crush remove osd.0
  6. ceph auth del osd.0
  7. ceph osd rm 0

Cache pressure

After using this setup for a few months, I started getting messages like client not responding to cache pressure during busy periods. It doesn't seem to cause any noticeable problems, but to diagnose and fix it, use the following steps:

Run ceph daemon mds.hostname perf dump mds and look for inodes and inodes_max. These are the current cache size and max size. If it is indeed full, mount the ceph filesystem with the -o dirstat option and cat the mountpoint. rentries_number will give the total number of inodes. (You could always use find | wc, but the mount trick is faster). Use the mds cache size option in ceph.conf to set a more reasonable value and restart the mds. In my case, the cache_max was 100k and the rentries_number was a bit over 200k.