Let’s say you’re building a central-database system that you may want to make into a distributed system later. Then you don’t want to tie yourself to serial numeric identifiers (like i.e. the ones that Ruby on Rails is full of).
What do distributed platforms do?
They leave the id-generation problem to the user (though they will provide details based on some very-unique number). IDs are strings (UTF-8 or ascii-safe), and can be quite long:
- Amazon SimpleDB allows an identifier of up to 1024 bytes.
- Google AppEngine allows up to 500 bytes according to rumor.
- memcached allows up to 250 safe-ascii characters according to the protocol spec
250 characters seems like a pretty large upper limit.
128 random bits should be unique enough for anybody
UUIDs are 128 bits and are encoded as 32 characters (base16 with 4 dashes). The possibility of an identifier collision is really really tiny (random UUIDs have 122 random bits).
Unfortunately, UUIDs are ugly:
http://example.com/68ff9b72-7b6a-4ea4-b35f-77ff50f938fb
It is just not a nice url. It would be nice if we could take 128-bit numbers and encode them as base64, or maybe base62 or url-safe-base64, or maybe even as base36 for increased compatibility. A 128-bit number is 22 characters in base64, 25 characters in base36. You end up with:
http://example.com/f5lxx1zz5pnok6cyejdnd7ri9
What about 64 bits?
If we went with 64-bit numbers, we’d sacrifice quite a bit of collision-prevention, but maybe not so much that it is scary on a per-application basis.
What is also interesting is that lots of software supports operations on 64-bit numbers a lot better than on 128-bit numbers. We would end up with 13 characters in base36 (11 in base64). I.e. in base36 that looks like this:
http://app.example.com/3w5e11264sgsf
That seems kind-of good enough, for now. Having failed inserts into the database seems like a reasonable way to avoid identifier collision, especially if our application is rather neat REST (so a failed PUT can be re-tried pretty safely).
Moving to a distributed system safely is possible if we have some reasonable identifier versioning scheme (13 characters = version 0, 14 characters = scheme version 1-10, more characters = TBD). Then in our app we match our identifiers using ^[0-9a-z][0-9a-z-]{11,30}[0-9a-z]$
(a regex which will also match UUIDs).
Some ruby
def encode_id(n) return n.to_s(36).rjust(13,'0') end def decode_id(s) return s.to_i(36) end def gen_id() return encode_id( rand( 18446744073709551615 ) ) end
Some MySQL
Besides the above functions, some ideas on how to maintain some consistency for ids across data types (tables).
CREATE FUNCTION encode_id (n BIGINT) RETURNS char(13) NO SQL RETURN LPAD( LOWER(CONV(n,10,36)), 13, '0'); CREATE FUNCTION decode_id (n char(13)) RETURNS BIGINT NO SQL RETURN CONV(n,36,10); CREATE FUNCTION gen_num_id () RETURNS BIGINT NO SQL RETURN FLOOR(RAND() * 184467440737095516); CREATE FUNCTION gen_id () RETURNS char(13) NO SQL RETURN encode_id( gen_num_id() ); CREATE TABLE ids ( -- this table should not be updated directly by apps, -- though they are expected to read from it numid BIGINT unsigned NOT NULL PRIMARY KEY, id char(13) NOT NULL UNIQUE, prettyid varchar(64) DEFAULT NULL UNIQUE ) ENGINE=InnoDB; CREATE TABLE mythings ( numid BIGINT unsigned NOT NULL PRIMARY KEY, id char(13) NOT NULL UNIQUE, prettyid varchar(64) DEFAULT NULL UNIQUE, something varchar(255) DEFAULT NULL ) ENGINE=InnoDB; CREATE TABLE mythings2ids ( -- this table should not be updated directly by apps, -- though its ok if they read from it numid BIGINT unsigned NOT NULL PRIMARY KEY, CONSTRAINT FOREIGN KEY (numid) REFERENCES ids (numid) ON DELETE cascade ON UPDATE cascade, CONSTRAINT FOREIGN KEY (numid) REFERENCES mythings (numid) ON DELETE cascade ON UPDATE cascade ) ENGINE=InnoDB; DELIMITER | CREATE TRIGGER mythings_before_insert BEFORE INSERT ON mythings FOR EACH ROW BEGIN INSERT INTO ids (numid,id,prettyid) VALUES (NEW.numid, NEW.id, NEW.prettyid); END | CREATE TRIGGER mythings_after_insert AFTER INSERT ON mythings FOR EACH ROW BEGIN INSERT INTO mythings2ids (numid) VALUES (NEW.numid); END | CREATE TRIGGER mythings_before_update BEFORE UPDATE ON mythings FOR EACH ROW BEGIN IF NEW.numid != OLD.numid THEN CALL CANNOT_CHANGE_NUMID_AFTER_CREATION; END IF; IF NEW.id != OLD.id THEN CALL CANNOT_CHANGE_ID_AFTER_CREATION; END IF; IF NEW.prettyid != OLD.prettyid THEN IF OLD.prettyid IS NOT NULL THEN CALL CANNOT_CHANGE_PRETTYID_AFTER_INIT; ELSE UPDATE ids SET prettyid = NEW.prettyid WHERE numid = NEW.numid LIMIT 1; END IF; END IF; END | CREATE TRIGGER mythings_after_delete AFTER DELETE ON mythings FOR EACH ROW BEGIN DELETE FROM ids WHERE numid = OLD.numid LIMIT 1; END | DELIMITER ; -- SELECT gen_id() INTO @nextid; -- INSERT INTO mythings (numid,id,prettyid,something) -- VALUES (decode_id(@nextid),@nextid, -- '2009/03/22/safe-id-names2','blah blah blah');
Some python
Python lacks built-in base36 encoding. Below is based on this sample, nicer than my own attempts that used recursion…
import string import random __ALPHABET = string.digits + string.ascii_lowercase __ALPHABET_REVERSE = dict((c, i) for (i, c) in enumerate(__ALPHABET)) __BASE = len(__ALPHABET) __MAX = 18446744073709551615L __MAXLEN = 13 def encode_id(n): s = [] while True: n, r = divmod(n, __BASE) s.append(__ALPHABET[r]) if n == 0: break while len(s) < __MAXLEN: s.append('0') return ''.join(reversed(s)) def decode_id(s): n = 0 for c in s.lstrip('0'): n = n * __BASE + __ALPHABET_REVERSE[c] return n def gen_id(): return encode_id(random.randint(0,MAX))
There’s a small typo in the Python example – the last line should have __MAX instead of MAX..