Making Large-scale CSRF protection performant and efficient

Defining the challenge

Imagine

>> GET /popularPage HTTP/1.1
>> Host: bbc.co.uk

<< HTTP/1.1 200 OK
<< Set-Cookie: BBC-UID=$BBCUIDCOOKIE
<< <form action="poll" method="POST">
<<   <input type="text" name="comment"/>
<<   <input type="submit"/>
<< </form>

>> POST /popularPage/poll HTTP/1.1
>> Host: bbc.co.uk
>> Cookie: BBC-UID=$BBCUIDCOOKIE
>> comment=me%20too

<< HTTP/1.1 200 OK

As presented this form is vulnerable to CSRF. Basic requirements:

We need to provide reasonable protection against CSRF
We need to do this in a way that is not very intrusive (i.e. this is not a banking app, it is a comment form)
We need to support users that use privacy software that spoofs the Referer header
We need to support users that have disabled javascript or are using a web browser that does not support javascript
We need to support users that have disabled cookies
We need to support users with old known-broken web browsers
We need to support both logged-in and anonymous users

So far, this problem is easy: we can use well-documented techniques based on unique tokens that are generated and checked server-side. We can also choose to improve the situation a little bit more when users support referer headers, cookies, javascript, etc.

But, here is the additional requirement that makes the problem a tad more difficult:

We need to do this CSRF protection in an efficient and scalable way (popular BBC pages receive a lot of web traffic).

In particular, we should optimize for the fact that most users requesting /popularPage will not submit this form. We also really want to avoid the use of server-side sessions, or any other similar constructs that become hideously expensive when you have multiple data centers each containing many web servers.

Being smart

Imagine we have a cheap way to generate lots of tokens that cannot be spoofed, are strongly associated with the user and/or their web browser, and are easy to verify just by looking at the token. Next, imagine that rather than store the token when it is generated, we only store the token once we receive it as a form submission. If instead we find that the token is already stored, we’re probably victim of a CSRF attack, so we then have to reject the submission.

It looks roughly like this on the wire:

>> GET /popularPage HTTP/1.1
>> Host: bbc.co.uk

<< HTTP/1.1 200 OK
<< Set-Cookie: BBC-UID=$BBCUIDCOOKIE
<< <form action="poll" method="POST">
<<   <input type="text" name="comment"/>
<<   <input type="hidden" name="bbcuid" value="$BBCUIDCOOKIE"/>
<<   <input type="hidden" name="nonce" value="$NONCE"/>
<<   <input type="submit"/>
<< </form>

>> POST /popularPage/poll HTTP/1.1
>> Host: bbc.co.uk
>> comment=me%20too&bbcuid=$BBCUIDCOOKIE&nonce=$NONCE

<< HTTP/1.1 200 OK

The nonce is calculated using something like this:

  BBCUID := COOKIE["BBC-UID"] or PARAM["bbcuid"] or RANDOM_STRING()
  SECRET := "someconstant"
  X := tstamp()
  NONCE := MD5( X, SECRET, BBCUID ) + X

To verify the nonce:

extract the timestamp X out of the submitted nonce
if X has expired, reject
re-calculate the nonce using the X and BBCUID from the submitted nonce
if submitted nonce and calculated nonce do not match, reject
if the nonce exists in the seen-nonce-store, reject
store the nonce (with an expiry timestamp) in the seen-nonce-store

It’s pretty important to verify the nonce after verifying all the other form data (but before actually processing the form data). That way, if the form fails to validate, you can present the user with the old nonce and they can try again.

Some of my esteemed colleagues, in particular Tom Yandell, came up with the above scheme and are coding it up at the moment. I’ve also thought about it for a while, and I suspect this’ll work well.

Using the right kind of storage

You may have wondered, “what is this seen-nonce-store you speak of?”. I am wondering too, ever since I was asked to suggest a solution 🙂

We’re not processing credit card transactions here so we obviously don’t need ACID. All we’re doing is protecting against a relatively obscure website attack, which even when actually exploited will not cause that many problems for us (we don’t even sell books). Our store does have to be fast and available, but we’ll sacrifice many things to achieve that!

Getting more specific, let’s say the conceptual API for our imaginary seen-nonce-store is something like this:

public interface NonceStore {
  /**
   * Check if a nonce already exists in the store.
   *
   * @param nonce the nonce to check for
   * @param timeout maximum time in milliseconds to spend looking
   *     for the nonce
   * @throws NonceNotFoundOntimeException if the timeout is reached
   *     without finishing the existence test
   * @throws TransientException if there was some error that prevents
   *     the existence test from completing
   */
  public boolean exists(String nonce, int timeout)
      throws NonceNotFoundOnTimeException, TransientException;

  /**
   * Store a nonce in the store. This is an asynchronous operation
   * that may fail.
   *
   * @param nonce the calculated nonce
   * @param tstamp the timestamp that was used for calculating the nonce
   * @param bbcUid the bbcUid that was used for calculating the nonce
   * @param nonce the nonce to store
   * @param expireAfter time in minutes to attempt to hold on to
   *     the nonce
   */
  public boolean store(String nonce, long tstamp, String bbcUid,
          int expireAfter);
}

Let’s further assume this seen-nonce-store is some kind of distributed horizontally scaled-out database that is usually accessed from apache/php over a tcp socket. There are some pretty specific characteristics to optimize for:

exists() has to be fast
exists() should fail rather than be too slow
exists() failures should be treated the same as non-existence
store() has to be fast
store() should ignore duplicate nonces
storage does not have to be consistent
storage does not quite have to be eventually consistent
it is acceptable to lose some data some of the time
it is acceptable to fail to store some data some of the time

What we’re looking for here is some kind of distributed tree with built-in expiry features. Some observations:

Not having to support a get() means that we should be able to do better than a generic key/value store.
Being able to lose some data should help when making store() fast. We can probably store() to local memory immediately and persist somewhere else later, asynchronously, in batches.
Needing only very weak consistency needs should help more. We can fsync() to disk rather infrequently if at all, and node-to-node replication doesn’t have to worry too much about ordering, retries, duplicate prevention, or any number of other problems.
exists() tests are likely to be heavily biased towards younger nonces. We could perhaps look into keeping old nonces around but in some kind of cheaper storage, etc.
Due to how we load balance, there is likely to be some per-user data center affinity that can be exploited. Replication across data centers seems somewhat sacrifice-able.
If we do not store the bbcUid (nothing says we have to), our storage and network traffic does not have any particular privacy or security concerns beyond making it sufficiently unlikely that replication traffic gets blocked. Things like multicast seem ok, use of SSL seems not needed at all.

Theory vs practice vs future practice

Due to the difficulty of building and provisioning such optimized pieces of infrastructure, our initial nonce storage system will use a combination of memcache and our in-house distributed key/value store which is built on top of CouchDB. They’re probably not the best hammers for this problem, but they’re good enough to deal with this particular nail for us, for now.

That said, I’d think its an interesting challenge to come up with something really solid, and define the de facto standard for session-less CSRF. It’d be nice if there was something as obvious and maintenance-free and easy-to-use and fast as memcached, with a really robust open source C/PHP library on top. Do you have any suggestions of where to start? Have you solved any similar problems? How?