A short url-safe identifier scheme

Let’s say you’re building a central-database system that you may want to make into a distributed system later. Then you don’t want to tie yourself to serial numeric identifiers (like i.e. the ones that Ruby on Rails is full of).

What do distributed platforms do?

They leave the id-generation problem to the user (though they will provide details based on some very-unique number). IDs are strings (UTF-8 or ascii-safe), and can be quite long:

Amazon SimpleDB allows an identifier of up to 1024 bytes.
Google AppEngine allows up to 500 bytes according to rumor.
memcached allows up to 250 safe-ascii characters according to the protocol spec

250 characters seems like a pretty large upper limit.

128 random bits should be unique enough for anybody

UUIDs are 128 bits and are encoded as 32 characters (base16 with 4 dashes). The possibility of an identifier collision is really really tiny (random UUIDs have 122 random bits).

Unfortunately, UUIDs are ugly:

http://example.com/68ff9b72-7b6a-4ea4-b35f-77ff50f938fb

It is just not a nice url. It would be nice if we could take 128-bit numbers and encode them as base64, or maybe base62 or url-safe-base64, or maybe even as base36 for increased compatibility. A 128-bit number is 22 characters in base64, 25 characters in base36. You end up with:

http://example.com/f5lxx1zz5pnok6cyejdnd7ri9

What about 64 bits?

If we went with 64-bit numbers, we’d sacrifice quite a bit of collision-prevention, but maybe not so much that it is scary on a per-application basis.

What is also interesting is that lots of software supports operations on 64-bit numbers a lot better than on 128-bit numbers. We would end up with 13 characters in base36 (11 in base64). I.e. in base36 that looks like this:

http://app.example.com/3w5e11264sgsf

That seems kind-of good enough, for now. Having failed inserts into the database seems like a reasonable way to avoid identifier collision, especially if our application is rather neat REST (so a failed PUT can be re-tried pretty safely).

Moving to a distributed system safely is possible if we have some reasonable identifier versioning scheme (13 characters = version 0, 14 characters = scheme version 1-10, more characters = TBD). Then in our app we match our identifiers using ^[0-9a-z][0-9a-z-]{11,30}[0-9a-z]$ (a regex which will also match UUIDs).

Some ruby

def encode_id(n)
  return n.to_s(36).rjust(13,'0')
end

def decode_id(s)
  return s.to_i(36)
end

def gen_id()
  return encode_id( rand( 18446744073709551615 ) )
end

Some MySQL

Besides the above functions, some ideas on how to maintain some consistency for ids across data types (tables).

CREATE FUNCTION encode_id (n BIGINT) RETURNS char(13) NO SQL
  RETURN LPAD( LOWER(CONV(n,10,36)), 13, '0');

CREATE FUNCTION decode_id (n char(13)) RETURNS BIGINT NO SQL
  RETURN CONV(n,36,10);

CREATE FUNCTION gen_num_id () RETURNS BIGINT NO SQL
  RETURN FLOOR(RAND() * 184467440737095516);

CREATE FUNCTION gen_id () RETURNS char(13) NO SQL
  RETURN encode_id( gen_num_id() );

CREATE TABLE ids (
  -- this table should not be updated directly by apps,
  --   though they are expected to read from it
  numid BIGINT unsigned NOT NULL PRIMARY KEY,
  id char(13) NOT NULL UNIQUE,
  prettyid varchar(64) DEFAULT NULL UNIQUE
) ENGINE=InnoDB;

CREATE TABLE mythings (
  numid BIGINT unsigned NOT NULL PRIMARY KEY,
  id char(13) NOT NULL UNIQUE,
  prettyid varchar(64) DEFAULT NULL UNIQUE,
  something varchar(255) DEFAULT NULL
) ENGINE=InnoDB;

CREATE TABLE mythings2ids (
  -- this table should not be updated directly by apps,
  --   though its ok if they read from it
  numid BIGINT unsigned NOT NULL PRIMARY KEY,
  CONSTRAINT FOREIGN KEY (numid)
    REFERENCES ids (numid)
    ON DELETE cascade
    ON UPDATE cascade,
  CONSTRAINT FOREIGN KEY (numid)
    REFERENCES mythings (numid)
    ON DELETE cascade
    ON UPDATE cascade
) ENGINE=InnoDB;

DELIMITER |
CREATE TRIGGER mythings_before_insert BEFORE INSERT ON mythings
  FOR EACH ROW BEGIN
    INSERT INTO ids (numid,id,prettyid) VALUES (NEW.numid, NEW.id, NEW.prettyid);
  END
|
CREATE TRIGGER mythings_after_insert AFTER INSERT ON mythings
  FOR EACH ROW BEGIN
   INSERT INTO mythings2ids (numid) VALUES (NEW.numid);
  END
|
CREATE TRIGGER mythings_before_update BEFORE UPDATE ON mythings
  FOR EACH ROW BEGIN
    IF NEW.numid != OLD.numid THEN
      CALL CANNOT_CHANGE_NUMID_AFTER_CREATION;
    END IF;
    IF NEW.id != OLD.id THEN
      CALL CANNOT_CHANGE_ID_AFTER_CREATION;
    END IF;
    IF NEW.prettyid != OLD.prettyid THEN
      IF OLD.prettyid IS NOT NULL THEN
        CALL CANNOT_CHANGE_PRETTYID_AFTER_INIT;
      ELSE
        UPDATE ids SET prettyid = NEW.prettyid
          WHERE numid = NEW.numid LIMIT 1;
      END IF;
    END IF;
  END
|
CREATE TRIGGER mythings_after_delete AFTER DELETE ON mythings
  FOR EACH ROW BEGIN
   DELETE FROM ids WHERE numid = OLD.numid LIMIT 1;
  END
|
DELIMITER ;

-- SELECT gen_id() INTO @nextid;
-- INSERT INTO mythings (numid,id,prettyid,something)
--   VALUES (decode_id(@nextid),@nextid,
--       '2009/03/22/safe-id-names2','blah blah blah');

Some python

Python lacks built-in base36 encoding. Below is based on this sample, nicer than my own attempts that used recursion…

import string
import random

__ALPHABET = string.digits + string.ascii_lowercase
__ALPHABET_REVERSE = dict((c, i) for (i, c) in enumerate(__ALPHABET))
__BASE = len(__ALPHABET)
__MAX = 18446744073709551615L
__MAXLEN = 13

def encode_id(n):
    s = []
    while True:
        n, r = divmod(n, __BASE)
        s.append(__ALPHABET[r])
        if n == 0: break
    while len(s) < __MAXLEN:
        s.append('0')
    return ''.join(reversed(s))

def decode_id(s):
    n = 0
    for c in s.lstrip('0'):
        n = n * __BASE + __ALPHABET_REVERSE[c]
    return n

def gen_id():
    return encode_id(random.randint(0,MAX))