★ Ceph Object Storage as fast as it gets or Benchmarking Ceph

 — 10 minutes read


CenterDevice is a distributed document management and sharing software without any single centralized component. In our next evolution we are going to use the distributed object store Ceph for storing our encrypted documents. In this article, my colleague Daniel Schneller and I present benchmarks for CenterDevice’s current Ceph installation. As it turns out, Ceph is as fast as it gets. In fact, Ceph gives you a reliable, cheap, and performant alternative to commercial SANs and distributed file storage.

Setup

We use four Dell PowerEdge 510 servers with 128 GB RAM and 14x4 TB disks — two mirrored disks for the OS and 12 disks for Ceph storage. The servers are connected via 4x1 Gbit network interfaces with two redundant switches. Two network interfaces each are bonded to form a high availability link according to IEEE 802.3ad. This way, a link or switch can fail while connectivity continues over the other link. Furthermore, using layer 3+4 link aggregation via LACP, different TCP streams may use both links, enhancing network throughput.

Before measuring Ceph’s Object Store performance, we establish a baseline for the expected maximum performance by measuring the performance of the disks and the network.

Baseline - Disk

For the disk performance baseline, we proceed in two steps. First, we measure the performance of a single disk. Second, we determine the performance of the whole disk subsystem of one server by stressing all disks at once.

To determine the write speed, we use dd to write arbitrary data as fast as possible. It’s important to circumvent the OS’ disk cache to get realistic results (oflag). We write a 10 GB file by reading from /dev/zero:
> dd if=/dev/zero of=/var/lib/ceph/osd/ceph-10/deleteme bs=10G count=1 oflag=direct

For the second step, we start the same dd process for all 12 data disks simultaneously — every dd is put in the background to run in parallel.
> for i in `mount | grep osd | awk '{print $3}'`; do dd if=/dev/zero of=$i/deleteme bs=10G count=1 oflag=direct &; done

For determining read speed, we read the created files; again, bypassing the cache (ifloag), reading from one and then from all disks:
> time dd if=/var/lib/ceph/osd/ceph-10/deleteme of=/dev/null bs=10G count=1 iflag=direct
> for i in `mount | grep osd | awk '{print $3}'`; do dd if=$i/deleteme of=/dev/null bs=10G count=1 iflag=direct &; done

The results are shown in figure 1. As you can see, a single disk can write up to 154 MB/s and read up to 222 MB/s. If all disks are used at the same time, the write speed decreases to an average of 93 MB/s and the read speed to 142 MB/s respectively which are pretty good results.

[caption id=“attachment21559” align=“aligncenter” width=“600”]![Figure 1 — Write performance for 1 and 12 disks in parallel.](https://blog.codecentric.de/wp-content/blogs.dir/2/files/2014/03/1Perf1vs12Disks.png ‘Figure 1 — Write performance for 1 and 12 disks in parallel.‘) Figure 1 — Write performance for 1 and 12 disks in parallel.[/caption]

Baseline - Network

Ceph suggests that there should be two separate designated networks: a cluster and a public network. Ceph uses the cluster network for internal synchronization as well as replication while the public network is handling client requests. We use four physical links, bonded together in pairs to form one link for each of the two networks. See the configuration for a bonded interface on Ubuntu below — we use a little trick to set the VLAN specific interface name to a human readable string instead of something like bond0.105. This makes the configuration less error prone.

> cat /etc/network/interfaces
...
auto bond2
iface bond2 inet manual
bond-slaves p2p3 p2p4 # interface to bond
bond-mode 802.3ad # activate LACP
bond-miimon 100 # monitor link health
bond-xmit_hash_policy layer3+4 # use Layer 3+4 for link selection
pre-up ip link set dev bond2 mtu 9000 # set Jumbo Frames

auto vlan-ceph-clust
iface vlan-ceph-clust inet static
pre-up ip link add link bond2 name vlan-ceph-clust type vlan id 105 # Little trick
  # to set human readable interface name
pre-up ip link set dev vlan-ceph-clust mtu 9000 # Jumbo Frames
post-down ip link delete vlan-ceph-clust # unset human readable interface name
address 10.102.5.12 # IP config
netmask 255.255.255.0
network 10.102.5.0
broadcast 10.102.5.255
...

The theoretical maximum performance for the bonded network interfaces is 2 Gbit/s = 250 MB/s. iperf  is an excellent tool to measure network throughput which we use for this measurement. We start an iperf server process on node01 and two iperf clients sending via two TCP streams each on node02 and node03:

[node02] > iperf -c node01.ceph-cluster -P 2
[node03] > iperf -c node01.ceph-cluster -P 2
[node01] > iperf -s -B node01.ceph-cluster
------------------------------------------------------------
Server listening on TCP port 5001
Binding to local address node01.ceph-cluster
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
[ 4] local 10.102.5.11 port 5001 connected with 10.102.5.12 port 49412
[ 5] local 10.102.5.11 port 5001 connected with 10.102.5.12 port 49413
[ 6] local 10.102.5.11 port 5001 connected with 10.102.5.13 port 59947
[ 7] local 10.102.5.11 port 5001 connected with 10.102.5.13 port 59946
[ ID] Interval Transfer Bandwidth
[ 4] 0.0-10.0 sec 342 MBytes 286 Mbits/sec
[ 5] 0.0-10.0 sec 271 MBytes 227 Mbits/sec
[SUM] 0.0-10.0 sec 613 MBytes 513 Mbits/sec
[ 6] 0.0-10.0 sec 293 MBytes 246 Mbits/sec
[ 7] 0.0-10.0 sec 338 MBytes 283 Mbits/sec
[SUM] 0.0-10.0 sec 631 MBytes Mbits/sec

The results are rather disappointing as we achieve only roughly 513 Mbit/s + 529 Mbit/s or about 130 MB/s. What’s wrong here?

IEEE 802.3ad using layer 3 and 4

IEEE 802.3ad uses the TCP connection 4-tuple (Src IP, Src Port, Dst IP, Dst Port) to determine the physical link of a bonded link to send data over.1 Yes, send. According to the standard, the sender decides which physical link to use. This means that our two senders (node02 and node03) select a sending link and start to transmit data as fast as possible. But the switch connecting the senders to the receiver (node01) does not know about that. All it sees are layer 2 frames designated for a specific MAC address to which it forwards the frames. So while the two senders fully utilize a 1 GBit/s link, the switch only forwards to the receiver via one single physical link which results in the observed throughput.2

Baseline - Network: done right

If we switch the experiment’s parameters to one sender and two receivers, we get the expected results. This time, node01 sends node02 and node03 simultaneously via two separate processes. We simply use netcat to send and pv to display the throughput.

[node02] > netcat -l -p 5001 > /dev/null
[node03] > netcat -l -p 5001 > /dev/null
[node01] > dd if=/dev/zero | pv | netcat node02.ceph-cluster 5001 &
173GB 0:25:36 [ 115MB/s]
[node01] > dd if=/dev/zero | pv | netcat node03.ceph-cluster 5001 &
173GB 0:25:36 [ 115MB/s]

Here we go. Now we see interface bonding in action — cf. figure 2.

[caption id=“attachment21560” align=“aligncenter” width=“600”]![Figure 2 — Network performance for 1 and 2 streams over a bonded network interface.](https://blog.codecentric.de/wp-content/blogs.dir/2/files/2014/03/21vs2streams.png) Figure 2 — Network performance for 1 and 2 streams over a bonded network interface.[/caption]

This means that the maximum transmission speed is around 230 MB/s — depending on the generated load and its processes. The maximum write speed for a single disk is around 150 MB/s and for all disks used at the same time around 90 MB/s. Reading speeds are 220 MB/s and 140 MB/s respectively.

Ceph Object Store

Ceph is a Reliable Autonomic Distributed Object Store (RADOS) that does not have a single point of failure as there is no central component, making it a perfect fit for CenterDevice’s architecture. In contrast to other distributed stores, Ceph uses an algorithm-only method to locate and store an object. This means that every client only needs to apply the CRUSH (Controlled, Scalable, Decentralized Placement of Replicated Data) algorithm to compute the corresponding object disk storage daemon (OSD) that is responsible for storing a particular object. Since different objects are stored by different OSDs, a client sending multiple objects talks to multiple nodes which allows it to make use of bonded links as described above.

Ceph has an integrated benchmark program which we use for measuring the object store performance. The corresponding command is rados bench. In general, this benchmark writes objects as fast as possible to a Ceph cluster and reads them back sequentially afterwards — see rados bench —help for details.3 Again, we proceed in two steps. First, we run only one load generator and second, we run 4 load generators on each Ceph node. It is important to note that all objects are replicated for data safety once in the system, leading to load on both the public as well as the cluster network.

[node01] > rados bench -p data 300 write --no-cleanup && rados bench -p data 300 seq # --no-cleanup leaves written data in ceph for read benchmark

We repeat this command in parallel on all four nodes. Figure 3 presents the results for both runs.

[caption id=“attachment21561” align=“aligncenter” width=“600”]![Figure 3 — Ceph read and write performance for 1 and 4 load generators.](https://blog.codecentric.de/wp-content/blogs.dir/2/files/2014/03/3radosbench.png) Figure 3 — Ceph read and write performance for 1 and 4 load generators.[/caption]

The result for the write speed of one load generator is quite surprising. It’s almost the perfect theoretical network throughput of 245 MB/s. There are two explanation for this observation. Since all node have 128 GB RAM, there is plenty of room to buffer data before actually writing them to the disks. In this way, the network and disk IO is decoupled. In addition, the benchmark is running on all Ceph worker nodes. Since we have 4 nodes, 1/4 of the data to write is delivered locally. Consistently, write speed drops significantly when load is generated by all 4 nodes. Interestingly, the average write performance is still greater than an individual disk even though every object is replicated once which doubles the load in the cluster network. The big surprise is read speed though which remains constant and independent from the work load.

When we compare these results with our baseline, we realize that even in high load situations (4 load generators for 4 worker nodes), the write and read speed is higher than a single local disk which is quite impressive taking into account that all writes were duplicated for data safety.

Ceph Performance Conclusion

Our internally run tests consist of many more scenarios than presented in this blog post. During all our tests, Ceph did its job without any quirks. The performance remains very stable even in high load situations and independently of the amount of data stored. In our opinion, Ceph is an excellent choice to store large amounts of data outperforming our former solution. Ceph gives you a reliable, cheap, and performant alternative to commercial SANs and distributed file storage.

We also have to say that the observed performance is the result of a carefully tweaked hardware and software configuration. So while Ceph is excellent choice for distributed object and file storage, the installation and configuration requires a high amount of Unix, networking, and Ceph internals knowledge.

There’s more

Ceph offers way more than the described object storage. It allows you to introduce multiple redundancies — we are going to use three times redundancy in production — and snapshotting for versioning objects. On top of the object store, Ceph has a Amazon S3 and OpenStack Swift compatible REST Gateway (RADOSGW) and even a file system layer called CephFS. We benchmarked these layers, too, but omitted them here for brevity.

Feel free to contact us if you have questions and comments.

Footnotes

1. If frames of a single TCP stream are sent over both links, out of order reception may occur. In such a scenario, TCP would assume packet loss due to duplicate ACKs and eventually significantly reduce the sending speed. Therefore, packets of one TCP stream always travel over the same wire.
2. There are actually frames forwarded using the second link as the average throughput is a little bit higher than the maximum single link speed of 125 MB/s.
3. Even though listed in the help output, random reads are currently not implemented.


First published at codecentric Blog.