This article is part of an ongoing series. See the blog entry Delivering Your Enterprise: The ilities for an overall discussion.

Today, girls and boys, we’ll talk about Availability, Reliability, and Scalability. Is this simply a way to keep your boss and your customers from getting upset with you? Sure, but there’s a lot more.

  • Availability – Is the system up and running when the users expect it to be available? Do you have to take the system down for maintenance? Will your CEO be able to do a successful demo at the busiest part of the day? If part of your system has a hardware failure, what happens to users actively using the system?
  • Reliability – Does the system perform badly or crash, and if so, does it go dark for all users? Have your users forgotten the last time they saw the system exhibit any sort of problem?
  • Scalability – What if a new contract was signed today and the number of users doubles tomorrow – can your system handle the new load? How long does it take you to increase capacity?

These are just a few of the questions you and your team should be asking about your enterprise systems. While robustness and disaster recovery are related to these topics, we’ll talk about them in a separate posting.

While we’re not trying to supplant a detailed book on the subject or best practices gleaned from the Web, these questions really boil down to some simple techniques:

  • Avoid single points of failure – provide redundant mechanisms to store your database, serve web requests, etc.
  • Distribute computing requests over multiple processing units – sending a database query to a cluster of servers on real or virtual machines is generally cheaper than configuring one large server, provides redundancy, and allows easy configuration of additional capacity.
  • Use flexible configuration techniques to arrange your system – use hardware or software (e.g., IIS or Apache HTTP server) load balancers to distribute load, allowing processing units to be added or removed as demand or maintenance requires.

From the outside looking in, the black box delivering your enterprise systems will not look any different, whether it’s a single server running both Tomcat and your database, or tens of servers with different roles. The difference will be in how that system responds to load and component failure.

If we open a black box that provides a fault-tolerant, high-availability view into your enterprise, what would we find? Actually, there is a surprising number of technologies and components deployed across multiple clusters of processors. Let’s take a number of examples from a typical industrial-strength Liferay portal deployment:

  • Authentication and authorization – Organizations often use components and technologies like Active Directory, LDAP, Tivoli, SiteMinder, or SAML to store information about users, authentication information (e.g., username and password), and authentication (what they can access). Using a centralized mechanism simplifies administration, and also makes it possible to implement Single Sign On (SSO) within the enterprise or across enterprises (Identify Federation). These systems are typically robust and scalable, but you need to make sure they work for you.
  • Web page requests – Web page requests are distributed for processing by hardware load balancers or via software like IIS or Apache HTTP Server. The software approach can also optimize access to static resources like images or icons that rarely change.
  • Liferay Portal tier-based processing – As a Java application, Liferay benefits from industry best practices for availability, reliability, and scalability. In addition, Liferay allows tier-based distribution of processing, which means the processing can be passed to clusters of dedicated systems. For more details on getting started, see Liferay Clustering.
    • Application server – Liferay runs on standard application servers like Tomcat and JBoss, so it benefits from distributing load across clustered application servers. This can be accomplished through hardware or software load balancing.
    • Shared object caching – To ensure that when node A creates an object, node B (and others) also see that object, Liferay uses shared object caching across the application server cluster. By default ehcache is used, but Terracotta can also be configured.
    • Search engine – Searching, as well as maintaining the index used for searching, can be an expensive proposition, so it makes sense to offload this processing to a separate cluster or service. Liferay makes it fairly easy to use Solr, a clustered version of Lucene, to handle this processing. Other packages like Elastic Search can be used as well (for more information on using Elastic Search, see these blog posts).
    • Shared storage – If you have uploaded documents or images that are not managed as web content (it’s rare to not have any), you need to place these items in a shared location across all application servers in the cluster. Management of this shared location is best handled by NAS. NAS can improve redundancy through RAID and clustering.
    • Document repositories – While Liferay provides a fast embedded document manager, your enterprise may choose another system. Liferay allows accessing multiple repositories using the CMIS protocol, including SharePoint, Filenet, and Documentum.
  • Database – This is often the first stop for enterprises concerned with availability and reliability. Most databases have the ability to configure a cluster of servers for increased capacity and responsiveness. In addition, databases can perform active or passive replication to protect your data in case of downtime or catastrophe.
  • Legacy system access – Existing enterprise capabilities and applications can be accessed via web services using SOAP or REST, or via web browser iframes. These services may have availability or scalability issues that your organization is unwilling or unable to address. This weakest link will have a major impact on what you can do with the rest of the system.
  • Ajax – Ajax certainly improves the ability of an enterprise to scale systems by offloading expensive display processing to browsers and smart devices. It’s vital you consider these devices an integral part of your system.
  • Web traffic encryption – Handling (SSL) encryption can be done at several places in your system, but the placement handles ease of configuration, processing load, and security. For instance, if the SSL certificates are handled by the load balancer when the web page request enters your system, the requests are converted to straight HTTP for the application servers. While this requires less processing (no encryption), it does have security implications.
  • Other services – Your system may depend on other services such as payment processing, GIS, CRM, ETL, address verification, etc. The reliability and scalability of these services may determine whether you can meet your goals, or if you need to start looking for a different provider.
  • Monitoring – You must be able to monitor your system effectively, especially as you introduce complexity to achieve availability and reliability goals. Organizations often start out with separate tools for each subsystem (database, application servers, etc.) and move to more comprehensive tools as complexity increases. Just be sure to recognize there is a cost to this monitoring, in terms of performance, scalability, licensing, training, and management, but also recognize that you can’t avoid this cost either.
  • Analytics – The good news is that analyzing traffic on your web sites is much easier today with systems like WebTrends and Google Analytics. These systems have been highly tuned for common web browsers and make approaches like A/B testing feasible. Remember, however, that even though the code is loaded into the user’s web browser, there is impact on the availability and reliability of your overall system.

If you’ve made it this far, you may be thinking “Interesting list, but so what?” The point is that delivering your enterprise with availability, reliability, and scalability is complex. There are a lot of technologies, a lot of connections between the pieces of technology, and a lot of people involved. And since it’s not going to happen overnight, be ready to cope with continual change. But unless your business is quite small, you have no choice but to plan these changes for your enterprise.