Building scalable systems is a deliberate, purposeful and complex undertaking. Architected correctly, following certain design principles, a system will scale (handle a high level of concurrency and throughput), albeit with some effort. On the other hand, systems which were not designed to scale often end up being expensive and unreliable when stressed. So unless you are building a small departmental application, it’s often best to sow the seed of scalability right into the core technologies, the base architecture of your application.
So how do you scale?
1. Identify a few design principles and guidelines before you start architecting. Google made the assumption that no single piece of the application could be deemed 100% reliable. What you think is bullet-proof will eventually fail. Knowing this forces the architect to handle failures gracefully and build fall-back systems. This principles applies to software, hardware and network components.
2. Avoid single points of failures. Application serves are load-balanced. Load balancers and switches are paired. Databases are setup using replication/pooling and/or partioning. Should any given machine fail, the traffic is automatically rerouted to other components.
3. Build your system as a set of discreet components. Each components hides its complexity (and error handling!) to the rest of the system. A component can run on all or some machines. For example, background jobs can be handled by a thread pool server(s) instead of bogging down an application server used to serve user requests. Please note that using property and descriptor files, a single application can be instantiated in a number of ways declaratively so the developer does not have to build several applications.
4. Automate your unit and load testing.
5. Build introspection capabilities into your production system. The worst thing about a bug is not being able to understand it. Threshold based introspection capabilities must be built into systems so a developer can understand what is happening, who is using the system, what resources are being used, etc…