The hardest bit in the web application platform challenge is making reasonable choices. Here’s a stab at some of them…
Hosting models
I see these basic choices:
- LAMP virtual hosting. If you can build everything you need with mysql+php and you have few enough users that you need only one database server, by far the easiest and cheapest.
- Application hosting. Code on github, project management with basecamp or hosted jira, build on AppEngine or Heroku or force.com. You don’t have to do your own infrastructure but you’re limited in what you can build. Also comes with a large chance of lock-in.
- Managed hosting. Rent (virtual) servers with pre-installed operating systems and managed networking. Expensive for large deployments but you don’t need all web operations team skills and you have a lot of flexibility (famously, twitter do this).
- Dedicated hosting. Buy or rent servers, rent rackspace or build your own data center. You need network engineers and people that can handle hardware. Usually the only cost-effective option beyond a certain size.
Given our stated requirements, we are really only talking about option #4, but I wanted to mention the alternatives because they will make sense for a lot of people. Oh, and I think all the other options are these days called cloud computing ๐
Hardware platform
I’m not really a hardware guy, normally I leave this kind of stuff to others. Anyone have any good hardware evaluation guides? Some things I do know:
- Get at least two of everything.
- Get quality switches. Many of the worst outages have something to do with blown-up switches, and since you usually have only a few, losing one during a traffic spike is uncool.
- Get beefy database boxes. Scaling databases out is hard, but they scale up nicely without wasting resources.
- Get beefy (hardware) load balancers. Going to more than 2 load balancers is complicated, and while the load balancers have spare capacity they can help with SSL, caching, etc.
- Get beefy boxes to run your monitoring systems (remember, two of everything). In my experience most monitoring systems suffer from pretty crappy architectures, and so are real resource hogs.
- Get hardware RAID (RAID 5 seems common) with a battery-backed write-through cache, for all storage systems. That is, unless you have some other redundancy architecture and you don’t need RAID for redundancy.
- Don’t forget about hardware for backups. Do you need tape?
Other thoughts:
- Appliances. I really like the idea. Things like the schooner appliances for mysql and memcache, or the kickfire appliance for mysql analytics. I have no firsthand experience with them (yet) though. I’m guessing oracle+sun is going to big in this space.
- SSD. It is obviously the future, but right now they seem to come with limited warranties, and they’re still expensive enough that you should only use them for data that will actually get hot.
Operating system
Choice #1: unix-ish or windows or both. The Microsoft Web Platform actually looks pretty impressive to me these days but I don’t know much about it. So I’ll go for unix-ish.
Choice #2: ubuntu or red hat or freebsd or opensolaris.
I think Ubuntu is currently the best of the debian-based linuxes. I somewhat prefer ubuntu to red hat, primarily because I really don’t like RPM. Unfortunately red hat comes with better training and certification programs, better hardware vendor support and better available support options.
FreeBSD and solaris have a whole bunch of advantages (zfs, zones/jails, smf, network stack, many-core, …) over linux that make linux seem like a useless toy, if it wasn’t for the fact that linux sees so much more use. This is important: linux has the largest array of pre-packaged software that works on it out of the box, linux runs on more hardware (like laptops…), and many more developers are used to linux.
One approach would be solaris for database (ZFS) and media (ZFS!) hosting, and linux for application hosting. The cost of that, of course, would be the complexity in having to manage two platforms. The question then is whether the gain in manageability offsets the price paid in complexity.
And so, red hat gains another (reluctant) customer.
Database
As much sympathy as I have for the NoSQL movement, the relational database is not dead, and it sure as hell is easier to manage. When dealing with a wide variety of applications by a wide variety of developers, and a lot of legacy software, I think a SQL database is still the default model to go with. There’s a large range of options there.
Choice #1: clustered or sharded. At some point some application will have more data than fits on one server, and it will have to be split. Either you use a fancy database that supports clustering (like Oracle or SQL Server), or you use some fancy clustering middleware (like continuent), or you teach your application to split up the data (using horizontal partitioning or sharding) and you use a more no-frills open source database (mysql or postgres).
I suspect that the additional cost of operating an oracle cluster may very well be worth paying for – besides not having to do application level clustering, the excellent management and analysis tools are worth it. I wish someone did a model/spreadsheet to prove it. Anyone?
However, it is much easier to find developers skilled with open source databases, and it is much easier for developers to run a local copy of their database for development. Again there’s a tradeoff.
The choice between mysql and postgres has a similar tradeoff. Postgres has a much more complete feature set, but mysql is slightly easier to get started with and has significantly easier-to-use replication features.
And so, mysql gains another (reluctant) customer.
With that choice made, I think its important to invest early on in providing some higher-level APIs so that while the storage engine might be InnoDB and the access to that storage engine might be MySQL, many applications are coded to talk to a more constrained API. Things like Amazon’s S3, SimpleDB and the Google AppEngine data store provide good examples of constrained APIs that are worth emulating.
HTTP architecture
Apache HTTPD. Easiest choice so far. Its swiss army knife characteristic is quite important. Its what everyone knows. Things like nginx are pretty cool and can be used as the main web server, but I suspect most people that switch to them should’ve spent some time tuning httpd instead. Since I know how to do that…I’ll stick with what I know.
As easy as that choice is, the choice of what to put between HTTPD and the web seems to be harder than ever. The basic sanctioned architecture these days seems to use BGP load sharing to have the switches direct traffic at some fancy layer 7 load balancers where you terminate SSL and KeepAlive. Those fancy load balancers then may point at a layer of caching reverse proxies like which then point at the (httpd) app servers.
I’m going to assume we can afford a pair of F5 Big-IPs per datacenter. Since they can do caching, too, we might avoid building that reverse proxy layer until we need it (at which point we can evaluate squid, varnish, HAProxy, nginx and perlbal, with that evaluation showing we should go with Varnish ๐ ).
Application architecture
Memcache is nearly everywhere, obviously. Or is it? If you’re starting mostly from scratch and most stuff can be AJAX, http caching in front of the frontends (see above) might be nearly enough.
Assuming a 3-tier (web, middleware, db) system, reasonable choices for the front-end layer might include PHP, WSGI+Django, and mod_perl. I still can’t see myself rolling out Ruby on Rails on a large scale. Reasonable middelware choices might include java servlets, unix daemons written in C/C++ and more mod_perl. I’d say Twisted would be an unreasonable but feasible choice ๐
Communication between the layers could be REST/HTTP (probably going through the reverse proxy caches) but I’d like to try and make use of thrift. Latency is a bitch, and HTTP doesn’t help.
I’m not sure whether considering a 2-tier system (i.e. PHP direct to database, or perhaps PHP link against C/C++ modules that talk to the database) makes sense these days. I think the layered architecture is usually worth it, mostly for organizational reasons: you can have specialized backend teams and frontend teams.
If it was me personally doing the development, I’m pretty sure I would go 3-tier, with (mostly) mod_wsgi/python frontends using (mostly) thrift to connect to (mostly) daemonized python backends (to be re-written in faster/more concurrent languages as usage patterns dictate) that connect to a farm of (mostly) mysql databases using raw _mysql
, with just about all caching in front of the frontend layer. I’m not so sure its easy to teach a large community of people that pattern; it’d be interesting to try ๐
As for the more boring choice…PHP frontends with java and/or C/C++ backends with REST in the middle seems easier to teach and evangelize, and its also easier to patch up bad apps by sticking custom caching stuff (and, shudder, mod_rewrite) in the middle.
Messaging
If there’s anything obvious in today’s web architecture it is that deferred processing is absolutely key to low-latency user experiences.
The obvious way to do asynchronous work is by pushing jobs on queues. One hard choice at the moment is what messaging stack to use. Obvious contenders include:
- Websphere MQ (the expensive incumbent)
- ActiveMQ (the best-known open source system with stability issues)
- OpenAMQ (AMQP backed by interesting startup)
- 0MQ (AMQP bought up by same startup)
- RabbitMQ (AMQP by another startup; erlang yuck)
- MRG (or QPid, AMQP by red hat which is not exactly a startup).
A less obvious way to do asynchronous work is through a job architecture such as gearman, app engine cron or quartz, where the queue is not explicit but rather exists as a “pending connections” set of work.
I’m not sure what I would pick right now. I’d probably still stay safe and use AMQ with JMS and/or STOMP with JMS semantics. 2 months from now I might choose differently.
Interesting stuff, but clearly from the perspective of a developer ๐
There’s a big gap between the likes of $big_business (who buy Oracle for 1000 cores without blinking, and look at open-source solutions like they just crawled out of a sewer), and the young upstarts who have even more traffic but insufficient revenue to just pour a load of bucks into piles of scary hardware and software to make it work off the peg, so have to make do with what they can scrape together.
I’m not sure where between them you sit as one of the top 20 websites; logic dictates that you can produce a solution without relying on commercial software or specialised hardware, but at the potential cost of sitting monkeys at typewriters until it’s too late, and with poorer resilience options when it comes to things like database failover (or just failure).
I’d hypothesize that it’s determined by those annoying external business requirements; how fast can this be made to work, how much will it cost to do it ourselves vs. getting someone else to do $x, how many 9s of uptime are required, how much traffic can we swallow when someone flies a plane into St. Paul’s Cathedral without falling over. Can you afford to be offline for an hour because of an avoidable failure? (although you’re possibly in the enviable position of not actually losing revenue as a result of it :P)
If you have the luxury of a CDN, then the likes of varnish et al. for caching of static objects may be a pointless optimisation.
Do you need a front-end layer? If you use Java app servers, application (i.e. L7) load balancers directly in front of them will obviate the need for a proxy layer (can you realistically cache your dynamic content anyway?)
Get someone else to do the hardware ๐
Lots of good points!
Yes its all very developer-centric…that’s not just because of me being a java weenie (though obviously it is part of the problem)…its also implicitly part of the problem statement I came up with – you didn’t think 100s of developers implied similar amounts of operations people, did you? :). Given a lot of typewriters and a lot of monkeys, a large part of the challenge is containing the monkey poo from clogging up the system. More on that later…
For those developers that are slightly better than monkeys, I tend to subscribe to the general flickr/facebook idea where you try and train developers enough to not make a complete mess, so that you can make do with a relatively small operations team who spend most of their time automating stuff (which then boils down to writing crap perl scripts which lack decent error handling?).
Caching dynamic content…hell yes, I’d say, the cost difference between 60ms and 60s latency is huge. Static content and CDNs…if you have some amount of kickass load balancers and some amount of multiple data centers (say east coast, west coast, europe), is it still worth the hassle of serving CSS, JS, JPG from a CDN? What CDN would you pick for such a purpose?
Oh, and yes you definitely need a frontend. Not for tech reasons but for culture reasons – I think the PHP monkeys are better at listening to the designers than are the java weenies (who listen to no-one, really…). The question is whether you need a backend…can you find enough good engineers willing to code PHP?
With hardware, having two of everything is one solution, but from experience it’s harder to get right than it seems. A good host will have lots of spare servers to replace yours with if it dies. (Pop out RAID disks, and plug that brain into a fresh box.) If you can have them agree to a 1 hour replacement SLA, that might be much better than straight co-loc.
Varnish is definitely worth having even if you have a CDN. Varnish can assemble dynamic pages from pre-cached fragments with EdgeSideIncludes (and Nginx can do something similar with SSI). The key reason to use a CDN is to conserve your main server’s bandwidth. Getting a 1Gbps uplink is MUCH more expensive than a 10Mbps one. (An actual uplink, not just a server with a GB network card.)
Add Cloud Hosting, the ability to start small and scale to big loads with Amazon EC2, Microsof Azure etc …