One of the hot new growth areas in data centers is what I call “pooled infrastructure”. The data center is a pool of interchangeable servers that can be rented out on a moment’s notice. For example, a customer can rent out 30 servers during peak usage and 10 servers during off-peak usage. Allowing customers to share a pool of servers greatly increases hardware utilization.
Along with the push towards renting out server capacity, the infrastructure providers are bundling value-added software with their infrastructure to save time administering and setting up the infrastructure.
Cloud and bandwagon hype
Nowadays, every technology company seems to be slapping the word cloud onto everything. I don’t like to use the word cloud because it is not specific and can be applied to a lot of things.
In the the case of pooled infrastructure, there are many variations on the concept. A single server can be sliced up to handle many different customers. For example, a web hosting server might host 3000 different websites. Or, a server can be sliced up into multiple Virtual Private Servers (VPS). Both of these can technically be considered a form of “cloud” “infrastructure as a service”.
The current paradigm shift is the ability to offer customers a way to rapidly rent out additional servers as their workload fluctuates:
- If virtualization software is used, it can take as little as a few minutes to spin up a new server. This allows for very rapid response to workload fluctuations. This process can be automated via APIs and does not require human intervention.
- Some companies rent out “bare metal” servers without virtualization software. (Sometimes virtualization software is undesirable.) It can take up to a few hours for a customer to get a new server.
For most workloads, server utilization is uneven. A business-oriented workload will see most of its use during business hours. A consumer-oriented workload like Netflix (which uses Amazon Web Services) will see most of its use outside business hours. One major efficiency is putting multiple workloads across the server pool to utilize all of the server capacity throughout the day. When most human beings are asleep, the infrastructure provider can schedule workloads that aren’t time-sensitive and can run at any time of the day.
Seasonal fluctuations and one-time server needs
A service like Reddit might see a huge one-time traffic surge due to events like a Q&A session (or iAMA) with the President of the United States. The goal is to find many non-correlated workloads to keep the utilization of the pool high.
Around Christmas, Amazon will see its server demand increase due to holiday shopping. During the rest of the year, it will have excessive server capacity. The dream is to sell the excess capacity to other seasonal workloads such as tax filing.
This feature is marketed to clients through the marketing buzzword elasticity.
Utilizing all types of hardware efficiently
A server will have different resources such as CPU, memory, and disk space. All workloads tend to use one resource more than the others. An online backup solution like Rackspace’s Jungle Disk or Dropbox (an Amazon client) will obviously use lots of disk space. Websites and online forums tend to use a lot of CPU and memory. By putting multiple workloads on a single server, that server’s resources are used more efficiently.
Some servers are more power-efficient than others. With pooled infrastructure, the infrastructure provider can try to allocate as much work as possible onto the most power-efficient computers first.
A small number of server applications can be put on slow CPUs designed for power efficiency (“microservers”). Intel, AMD, ARM, AMCC, and other semiconductor companies design such CPUs. I do not believe major pooled infrastructure companies offer such specialized hardware at the moment.
How the technology works
To make it all work, the infrastructure provider has to add layers of software on top of the underlying hardware. Not all of a data center’s hardware is the same. Virtualization software makes all of the underlying hardware appear the same to the workloads running on the server. This allows the server hardware to effectively be interchangeable. Automation software and control panels allow the clients to start/stop new servers.
The biggest infrastructure providers own private fiber optic networks that connect their data centers. These networks are much faster than the public Internet for various reasons. One of the interesting things that can be done with these networks is that systems can be duplicated onto other data centers for redundancy. Because each data center is itself a potential point of failure, replicating a software system onto other data centers can improve availability/uptime. Private networks have dramatically lower network latency and make it possible for this mirroring to be viable across a wider range of applications.
The problem software developers face is that it takes time to setup infrastructure and to configure it. Infrastructure companies can offer software to help automate all of these tasks. This reduces the size of IT staff that a software developer needs. I believe that a lot of value creation will happen in this area as it will allow software developers to save time. For example, Amazon’s DynamoDB software will allow software companies to quickly setup a database that is replicated across 3 data centers. This is technically challenging to do because it requires low latency from a private fibre optic network and ways around the “virtualization tax” (virtualization greatly increases networking latency). The benefit of such a database is that it can theoretically survive 2 data center failures. In practice, entire Amazon “regions” have gone down simultaneously taking out multiple groups of data centers.
Amazon also offers software for load balancing, both relational and noSQL databases, content delivery network (CDN) integration, and many other features.
For both the customer and the infrastructure provider, the economic calculations are complicated. Ironically, a number of new startups are helping AWS customers find ways of saving money.
The infrastructure provider wants to simplify its billing structures to make life easier for existing and potential clients trying to figure out how much money they can save. Simplifying the billing structures and metering means that there the margins aren’t always constant. Where the margins are too low, the infrastructure provider may be accidentally selling its infrastructure below its rate of return threshold. Charging too much will push away customers that would otherwise by highly profitable.
From the client perspective, clients naturally wants to reduce their costs. They have to understand the quality/performance of the various options presented to them so that they can figure out which options lower their total costs. Evaluating their options is complicated because:
- With pooled infrastructure, performance is difficult to measure. Partly this is because of multi-tenancy. Because many tenants share the same server, each tenant will affect the performance of other tenant’s workloads (the “noisy neighbour” problem). There are also differences in performance from the way pooled infrastructure configure their storage as well as performance differences in the underlying hardware. These factors make it more difficult to compare the performance/quality of the different options out there. Here is one example from the Scalyr blog.
- Infrastructure providers often also sell software that automates certain processes for the client. Because they effectively bundle the hardware with value-added software, it is difficult to compare their pricing versus unbundled hardware.
Over time, the infrastructure providers will likely tinker with their business processes, billing structures, and software to make this process less complicated for the end users.
New billing structures will also likely open up the pooled infrastructure providers to new markets. Amazon for example pioneered a “spot” market for tasks that do not have to be run right away (e.g. analytics reports). Amazon auctions off excess capacity for low-priority uses. However, this capacity can be pulled from the customer on a very short notice. This breaks up the market into high-priority and low-priority workloads. If for some reason Amazon runs short on capacity, it can defer the low-priority workloads to a later time to avoid running out of capacity.
Because pooled infrastructure companies need to charge for excess capacity, their services are priced too high for customers who need a baseline level of capacity. Customers can get around this by owning servers in a co-located data center and renting out the pooled infrastructure that they need. There is some overhead to doing this since the network speeds between the data centers will not be quite as fast as those within a data center. It makes more sense for the infrastructure company to offer some form of pricing somewhat similar to the customer buying and owning their servers.
Amazon’s Reserved Instances pricing allows customers to pay an upfront fee to reserve capacity for 1 or 3 years. In theory, such pricing structures could mimic the economics of buying a server. In practice, Amazon’s Reserve Instances are far more expensive than buying servers and colocating them. In the future it is possible that this gap comes down if Amazon prices more aggressively and/or finds way to squeeze more economies of scale and efficiency out of its data centers. I suspect that the reason why Amazon’s current pricing is so high is because Amazon cannot build data centers fast enough. Amazon presumably makes very high margins from renting out its servers and selling value-added software (that must be bundled with infrastructure). Its reserved instances may not be priced sensibly because Amazon wants to allocate its capacity to higher margin revenue streams.
In practice, many of the major pooled infrastructure providers have had major outages for various reasons (e.g. software memory leaks / bad programming, misconfigurations, etc.).
Virtualization software adds a point of failure that may be undesirable for applications where extreme reliability is desired. Amazon for example had to reboot some of its servers to patch a security hole in Xen hypervisor.
Virtualization can hurt performance. Amazon had to put in a lot of R&D into solving the problems that virtualization caused on network latency (see James Hamilton’s presentation slides).
Some customers want an extremely high level of reliability, security, and/or performance. Others have unique compliance needs. These customers often need a solution that is specifically designed for their unusual needs.
Some workloads are more cost-effective when they “scale up” by placing the workload on a more powerful server rather than “scaling out” across a larger number of servers. The two approaches are largely different markets. However, there is some overlap because there are some workloads that could choose either the “scale up” approach or the “scale out” approach.
One of the concerns with pooled infrastructure is that there is a degree of vendor lock-in. It is more difficult for clients to leave because they have to design their software to work well with the hardware choices available to them. As well, the software has to work well with the layer of proprietary software on top of that hardware. With traditional colocation, it is easier for clients to leave. One potential solution to vendor lock-in is through open source hardware designs and open source software.
To date, many of the pooled infrastructure solutions have had availability problems. Amazon Web Services and its competitors have had numerous major outages. Some of these outages are related to software bugs. The conflict here is between putting resources into better availability or into creating more value-added software.
With current pricing structures, pooled infrastructure can be far more expensive than alternative solutions. For workloads that run 24/7, pricing is not competitive with buying hardware for colocation.
Facebook, Rackspace, and a number of other companies are trying to open source the design of their data centers, the hardware inside them, and the layer of software on top of the underlying hardware. From a business perspective, they are giving away their research and development for free in the hopes that other companies will reciprocate and contribute R&D towards their open source projects. The obvious downside with giving away your R&D is that you can’t make money from selling it.
For Facebook, the rationale for open source is clear. Data centers are a big expense for Facebook and Facebook wants to drive its expenses down. Because Facebook essentially sells software (monetized through advertising), commoditizing the data center does not lessen their competitive advantage (software). Facebook will make money on its software rather than data center R&D.
For Rackspace, the situation is more complicated. Lowering their data center expenses will help them compete against Google and Amazon. However, it will also help Rackspace competitors. Rackspace will try to differentiate themselves based on service and support. A comparison to Red Hat (RHT) is likely appropriate here. Red Hat’s software (server operating systems) is open source. Customers can (mostly) get Red Hat software for free without having to pay money for it. However, Red Hat is able to make money through services and support. To some degree Red Hat is selling its reputation. Clients who purchase IT services want to know that their vendor won’t screw up and will stand behind what they sell. Reliability and availability can be far more important to customers than saving small amounts of money. Rackspace will presumably try to do something similar to Red Hat.
Downsides to Amazon Web Services
AWS: the good, the bad and the ugly – This blog posts talks about Awe.sm’s decision to leave AWS. (They would use AWS again.)
Moz (formerly SEOMoz.com)
Moz’s 2013 Year in Review – This blog post from Moz’s CEO talks briefly about their decision to take most of their workload off AWS. Doing so reduced their costs by more than half.
A Youtube video tour of Google’s data center – This is the only technology link here that is in layman’s terms.
Reddit IAMA with a former Amazon Web Services engineer.
Warehouse Scale Computing – This book takes a deep dive into cutting-edge data center design practices and the connections between data center design and software development.
The website of Amazon’s data center guru James Hamilton – There are many papers on Amazon’s data center technology.