Capacity planning for the network

Lesson learned: on our HP blades, with the standard old crappy version of memcached that comes with red hat, when we use them as the backend for PHP’s object cache, we can saturate a 1gbit ethernet connection with CPU usage of about 20-70%:

Zenoss/RRD graph of memcache I/O plateau at 1gbit

No, we did not learn this lesson in a controlled load test, and no, we didn’t know this was going to be our bottleneck. Fortunately, it seems we degraded pretty gracefully, and so as far as we know most of the world didn’t really notice 🙂

Immediate response:

  1. take some frontend boxes out of load balancer to reduce pressure on memcached
  2. repurpose some servers for memcached, add frontend boxes back into pool
  3. tune object cache a little

Some of the follow-up:

  • redo some capacity planning paying more attention to network
  • see if we can get more and/or faster interfaces into the memcached blades
  • test if we can/should make the object caches local to the frontend boxes
  • test if dynamically turning on/off object caches in some places is sensible

I have to say it’s all a bit embarrassing – forgetting about network capacity is a bit of a rookie mistake. In our defense, most people doing scalability probably don’t deal with applications that access over 30 megs of object cache memory to service one request. The shape of our load spikes (when we advertise websites on primetime on BBC One) is probably a little unique, too.

update: I was mistakenly using “APC” in the above to mean “cache” but APC is just the “opcode cache” and is completely disjoint from “object cache”. D’oh!