Having spent a week playing with various features of XenServer (with a focus on automation), I’ve only scratched the surface of what it can do. I have a sense, though, for the power of the system, and what it would take to get it to work at my company. It has all the features that we need to start, some nice-to-have features that we’ll probably use, and then extras beyond that which we probably won’t play with for the foreseeable future.
Our must-have features are pretty simple:
- Create and manage long-running linux VMs
- Have pretty good disk & network performance
- Have a strong CLI environment
- Have a strong API for automating as much as possible (especially the day-to-day tasks like spinning up a new VM)
- Have a good (and preferably long) history of being run in production at scale
All of these requirements are met by XenServer. In our environment, everything that needs particularly fast disks or network is running on dedicated hardware, so our performance requirements aren’t terribly restrictive. XenServer does add a layer of latency when accessing resources (disk/network), but it isn’t much (especially when using the paravirtualization drivers). We’ll probably be using a RAID10 array with lots of disks, which will help with disk speeds, too. Using the current standard toolstack (XAPI), we get a powerful (but not terribly user-friendly) CLI, and the same is true for the (XMLRPC) API. The architecture that must be understood to work with xen is complex, but once you know all the acronyms I don’t find it hard to navigate. The Windows-only management GUI is pretty slick, but we probably wouldn’t use it. We might look an alternative that runs on linux. As for having a history of production use, there are a few things I look for. When successfully used by large organizations for important parts of their infrastructure, security, performance, and reliability get put through their paces, and problems get fixed. Some of the largest clouds (AWS & Softlayer, for instance) run on xen or XenServer, which gives me a lot of confidence.
There are a number of features that we’ll probably build into our workflow, but that aren’t really requirements to picking a solution.
- Live (online) migration of VMs
- High availability of VMs
- Snapshotting VMs
- Copy-on-Write Clones
The high availability requires going with shared storage (SAN, NFS, etc), which brings its own availability and performance concerns. Most of our services are inherently distributed and can handle a server going down (without degrading the overall performance of the service), so this isn’t a huge deal for us. However, the similar “live migration” feature allows you to move a running VM from one hypervisor to another (and doesn’t require shared storage). I could see this being very useful in the event that we need to do maintenance on a hypervisor or storage array. Snapshotting VMs and Copy-on-Write clones work together to give us quick and cheap new VMs.
XenServer would fit our needs just fine, and at this point I would not have a problem choosing it. My biggest concerns are the complex architecture that must be understood (and is a barrier to users that aren’t familiar), and the unfriendly XMLRPC API (for instance, at the CLI lots of arguments are optional; at the API, every argument must be specified). Both of these problems can be mitigated with good tooling.
Up next, I’ll look into a KVM solution as an alternative, and put it through the same steps I’ve explored with XenServer.