Is Windows Clustering Virtually a Thing of the Past? Print
Written by Charles Roberts   
Wednesday, 19 November 2008

Introduction

Way back in the halcyon days of Windows NT4 Microsoft introduced both Clustering and Load Balancing technologies to the unsuspecting Windows engineer and a whole new range of technology and terminology was born.

So now here we are, a decade or so and several editions of Windows server down the line and clustering has matured and is widely used in all areas of our industry – but despite it’s wide use and years of refinement is it all it’s cracked up to be? Is it really the silver bullet to keep your applications up and running? Or is windows clustering virtually a Thing of the Past?

The History of Clustering

Microsoft released it’s first clustering technology in 1995 in the Windows NT4 Enterprise Resource Kit. Codenamed “Wolfpack” it was developed along with it’s Load Balancing counterpart (codenamed “Convoy”) in conjunction with Tandem Computers and Digital with the aim to allow 2 Windows NT4 Enterprise Servers to share resources and automatically fail over in the event of a terminal issue.

While clustering has been around for many years on Unix and Vax systems in particular, the rapid growth in Windows server usage had created a demand for high availability that there simply wasn’t a solution to. After the loud introductory fanfares and press releases subsided it was plain to those in the industry that Microsoft Cluster Services wasn’t going to be the solution either. In those early days Windows Clustering was, to say the least, bleeding edge and although easy to implement was hideously unreliable, not helped by Windows NT4’s penchant for variously blue screening or spuriously rebooting. It’s fair to say a large portion of blame also lies with the hardware that was available at the time. SCSI switching between nodes or elementary fibre arrays not helping the brave new world come to fruition.

But come to fruition it did – Windows 2000 Advanced Server saw a reworked and patched Clustering solution that finally gave the enterprise community a native Microsoft clustering solution that could be relied upon. Further advances in windows 2003 both in the underlying stability of the OS itself and the clustering technology produced something that could be regarded as a run of the mill, out of the tin solution to high availability and resilience. Additional solutions from other vendors such as veritas and EMC added functionality layers such as stretch geographically dispersed clusters to the list of options available to the systems architect.

Why Cluster?

Typically the demand for clustering stems from the need, or perceived need, for a business to have services available for the most constant amount of time possible thus clusters are typically implemented as a way to improve reliability and resilience, achieve high or 100% uptime, to share application load in active/active configuration or provide an elegant BCP/DR solution. Fast Failover and Failback is assumed to come as standard.

Let’s look at these requirements and how they really apply to the real world of clustering:

100% uptime, better reliability and resilience. The first thing to remember when either faced with or quoting uptime figures is that they are largely subjective and rarely accurate or useful. This is almost invariably because they are measured at different levels by different people. To give some examples: An application runs on a server and is accessed every second by the user base. The cluster it runs upon has a failover time of three seconds. The IT operator fails over the cluster for maintenance from the A node to the B node and everything works perfectly and the application now runs on the B node. From the IT operator’s point of view there has been no outage, the cluster is performing normally and well. From the Application users point of view however they have just experienced a three second outage. Conversely the A node fails and the application fails over to the B Node and the users experience their 3 second outage. However the A node takes 3 days to repair and put back in action, so the IT operator experiences a far greater outage.

The fact of the matter is that uptime and outage times mean one thing to one person and something else to another. Different levels of outage affect different users to a greater or lesser extent – If a user has spent half an hour inputting data that is lost when a cluster successfully fails over in half a millisecond then our uptime figures will still be good however we should be more concerned with failure impact than outage. If a cluster takes half an hour to failover and there is no impact then is this an issue? Probably not but the uptime figures will look dire. The Uptime figures therefore are the first things to fail when trying to ascertain the level of failure of a system.

Reliability and resilience - An area that the windows cluster will excel obviously. Really? Well maybe not. If you’ve had experience of modern servers then you’ll have noticed that they are incredibly reliable and they have been for years. Hot plug disks, power supplies and memory all mean a hardware failure has to be pretty catastrophic these days to cause terminal hardware failure. A recent report on a large, and unnameable, server base of approximately 15,000 servers revealed an average of 65 hardware failures a month, of these all but 9 were fixed without outage. It’s fairly plain then that where as hardware failure is on everybody’s minds, the reality is far better than is perceived.

Operating system outage, of course, is more frequent these days, but should be renamed to “Planned OS outage” due to security patching and software calls etc. Things have come a long way from Windows NT4 and while it’s still fashionable and fun to knock windows, it’s actually a very stable platform these days. Where outages do occur they are largely in the application space either with the application failing itself or with the application interacting badly with windows. Given this fact, the application will fail whether clustered or on a standalone server and no amount of money spent on cluster solutions will resolve the issue. For reliability then, the focus of the organisation should be on ensuring applications are well coded, tested and as reliable as possible. In this way high reliability levels will be attained without the need for clustering.

DR/BCP using stretch clustering across sites is a commonly used technique to provide fast an efficient failover in times of application and site failure. In many ways its a solution that provides many benefits. Business owners like the idea as it seems that they’re using all the boxes all the time rather than having a DR box sitting around doing nothing while they wait for an emergency to happen. Likewise application owners and DBA’s like it because they have to do nothing to reconfigure their application, DNS or data in the event of issues and these benefits cannot be discounted in terms of simplifying the application, database, support and for attaining consistent results on failover. As with most things in life there are some downsides. Firstly the stretch clustering technologies commonly used are perhaps not among the most user friendly and solid systems available. Typically the time and effort saved by the application and DBA teams is transferred (and some times multiplies in size) to the Windows administrators. In addition the issues faced when manually failing over applications tend to be fairly straight forward and well known, whereas the issues faced with stretch clustering tend to be complex and long lasting. When we take into account that the main driver for DR/BCP is to maintain service regardless of failure it would appear contrary to the basic concept to put in a single point of failure into the equation – yes a cluster is spreading your application across two sites, but what happens when your cluster itself has errors and drops offline with a costly, long lasting and difficult to fix issue? No matter how cleverly designed, every stretch cluster will have a single point of failure somewhere (or no synchronicity between nodes) whether that is the cluster itself or the storage. As long as the potential outage is known and you’re happy with that risk then that’s fine. If you’re expecting 100% uptime and bullet-proof BR and BCP then you need to look elsewhere.

While we’re on the subject of expectations, we should put to bed once and for all a common misconception about windows clustering so beloved by those in the business that hate the idea of a server sitting about doing nothing - Active/Active clustering. Some operating systems can do this very well, windows is not one of them. This may come as a shock to many people but there is no such thing as a Windows Active/Active cluster. To have an active/active cluster requires a system to run a single application, or instance thereof, sharing storage, state etc across 2 or more nodes in real time – Windows cannot do this. Instead you can only run multiple instances in Active/Passive mode. In a “standard” windows 2 node cluster your application runs on Node A. Node B stays inactive until node A has an issue and the application fails over to Node B – Active/Passive. What can be configured with a windows cluster is while Node A is running an application (and is ready to fail over to Node B) Node B could run a different application and could fail over to Node A if needed. Node A is active and Node B is passive for application 1; Node B is active and Node A passive for application 2. This has the benefit of saving loads of money as all the servers are doing something so we don’t have anything doing nothing. Well not really – yes both the nodes are busy day to day but whereas with a true active/passive cluster each node needs to be sized to handle 1 application as that’s all it has to do, with a multiple instance cluster each node needs to be sized to handle 2 applications. There is little point, if node A is running at 75% utilisation failing over Node B’s application to it if it will push the utilisation to 100% - if you do this you lose both applications. So to run multiple instance clusters then typically you’ll need much larger servers which will probably work out more expensive than your smaller individual solution (in tinware at least) – N+1 means 1 box that’s able to cover your load not one that’s already in use doing something else.

So why cluster if there are so many issues and pitfalls and it doesn’t really do what everyone expects (especially “the business”). Well the answer is simple – despite it’s failings it is still a useful and relatively easy to implement solution that can in the right conditions give you added benefit – the key to this is expectations. If you (or your business users) expect by some miracle to implement a cluster and achieve 100% uptime or anything close, then somebody somewhere is going to be bitterly disappointed. Implement clustering within your architecture using the correct applications and with the right messages surrounding it and it will be a useful addition to your armoury.

I still want to cluster so what’s the best way?

The first golden rule of clustering is the same for all implementations within (and probably outside) the IT world – keep it as simple and straightforward as you possibly can. This is a particular danger within IT because lets face it, the main reason most people decided to carve out a career in this industry is because they like toys and gadgets and new “stuff”. The most important thing to do while deciding on your high availability solution is to honestly identify your requirements. I’m sure everyone has had the experience when asking the question “What availability do you need for this system?” and the owner replying that nothing short of 100% would be acceptable. This then sends budgets and imaginations soaring and before you can say “unsupportable mistake” your new implementation spans five continents and allows whoever may be concerned to purchase teddy bears when ever they need with astounding reliability. The first question to ask then is not the level of availability needed but “Can you define the amount of actual loss that the company will incur over 5 minutes, 1 hour and 8 hours?” This will then give you an indication of the type of system you need to be specifying. At this stage the more astute business owner will spring to life and probably announce importantly that it’s not just actual costs but reputational loss that is important. This is a widespread and widely used argument for complicating systems and increasing project cost and based on nothing but vapour. While ensuring the security of your systems to prevent defacement of web sites, or protecting your customer’s data from theft for instance is essential to maintain your company’s reputation, having a web site out of action for an afternoon that causes no adverse monetary effect on your customer is very unlikely to have any affect on your company’s reputation either.

To re-iterate the point above, establish the outage time for the system that will produce the maximum level of actual loss to the business acceptable then create the simplest and most supportable system from that requirement. If you can make it better than requirements with no extra spend or ongoing support cost then do so.

The second golden rule of clustering is – if your application isn’t cluster aware, don’t cluster it. Unless your application gains benefit from the extra complexity but maintaining state and data when failed over then you’ll gain minimal benefit, if any, from clustering it, use other methods instead; web site load balancing, scripted fail over or even a stand alone server.

So after establishing the need and suitability for clustering the next step is to establish the topology you’ll implement. If you’re simply interested in a single site then your options are straight forward, two node windows cluster either in active/passive or multiple instance mode (see comments above if you choose this option)

For multiple site or Prod/DR configurations you have a few further options available to you.

Local Windows Cluster with Standalone DR

 

Fairly straight forward, easily supportable and reliable, you’ll find every windows engineer able to support this type of system from the minute you hire them. Less elegant fail over to DR than some of the later systems listed here but very reliable. Typically DBA and application owners do not like this solution as there are some DR failover tasks that need to be performed by them rather than being automated. In addition strict controls must be in place to ensure all servers are updated when new releases take place. This solution also fails the “but we’ve got a server doing nothing” management test as well, despite the fact that it’s the one that’s least susceptible to network and power issues and has the lowest support costs.

Stretch Clustering

There’s a few options here, you can implement Multiple Node Segregation or you can opt for a third party supplier such as Veritas or SRDF-CE (GeoSpan) from EMC.
Let’s look at Multiple Node Segregation first.

 

 

The first thing we notice with this configuration is that we have three nodes across three individual sites. All of the nodes are simple running Microsoft Windows Clustering. Notice however that only two of the nodes are connected to the SAN storage which is mirrored across the sites to ensure consistency of data across nodes. These two nodes are the only ones running the application; the third node is there to ensure cluster integrity only. With a two node cluster when the two nodes lose contact with the quorum and /or each other then it can result in a split brain condition where each node thinks it is in charge and takes the resources or neither thinks they are in charge and no node is active. Either way your cluster and so your application fails until your engineer sorts out the issue. With Multiple Node Segregation this is attempted to be addressed by giving each node its own quorum and replication the data between them. Two of the nodes run the application; the third server in the cluster compares notes with the other two and gives one of them a majority making it the Active node. Should the two application nodes both try to be active i.e. to split brain, then it is the third node that provides the additional vote to prevent this happening.
This solution provides automated failover to DR with minimal input from DBAs and Application teams however it is expensive to implement – mirroring your SAN will require hardware and very fast intersite links, the quorum replication can be unreliable as well. In terms of support it is usually less reliable and requires more support and expertise than the local cluster solution in all the Windows, Storage and Networks Teams.

Third Party Stretch Cluster

 


This solution can be implemented with several other third party solutions not just Geospan as listed. Veritas have a simpler topological solution although uses very different technologies. Here the stretch cluster works in exactly the same way as a local cluster except geographically dispersed. In the same way as a local cluster can be affected by split braining then this solution can be also. However with WAN links and SRDF replication there are far more links to go wrong and so reliability is lower for this model. Hardware costs are cheap, although third party software is not. Support skills are rarer with third party solutions and therefore will be more expensive and used much more infrequently.
As a final note, all these third party solutions could be used in Multiple Node Segregation mode bringing the best (and worst) or the two solutions to the system.


Alternatives

As we have seen, there is no clustering solution that is the silver bullet in terms of uptime, reliability, simplicity and supportability, all have there weaknesses.
There are however some technologies that are overlooked when looking for high availability solutions; Boot from SAN for instance provides very good response and reliability. More promising however is VMware or virtualisation.

Typically when VMware is mentioned we instantly think of getting all the cobwebby, large, slow servers in our data centre all running in virtual format on a medium sized single server and this is of huge benefit. Where high performance applications that use lots of CPU, network and memory are concerned, sharing resources can obviously cause performance issues. Also there is an overhead to running VMware or virtualisation itself, although with hardware performance increases this is becoming negligent.

Where the business case allows, it is possible to buy suitably sized servers and run those CPU intensive applications on single VMware instances and so alleviate the performance concerns. If we ignore server consolidation and look at some VMware features we uncover some interesting benefits. Vmotion a component of VMware is a feature that allows a virtual server to be moved to any other VMware server within the farm with minimal, if any, downtime. This would allow not just cluster type capability of maintaining an application and providing DR capability, but would allow true flexibility of server usage. The Production “failover” servers for instance, when not in use could run the test instance servers for the application until such a time when the application required failover. When this occurred the test servers would be taken off line and the production instances brought up. This is a far cleaner and more efficient use of hardware and systems than traditional clustering.

Taking this one step further, grid modelling brings the option not only of running applications across many, sometimes hundreds of physical, but the virtual servers as well. In this way extremely high availability and true reliability will be achieved.

So….Is Windows Clustering Virtually a Thing of the Past?

I hope over the last few minutes I’ve established that some of the claims for cluster technology are not quite up to the levels that we’d like. Clustering still has a place in the environment but it is one that will diminish very quickly as virtualisation becomes more widespread and the big OS vendors put their weight behind it. Likewise Grid technology is hot on the heels of the VMware crew and will consign clustering to the history books. When this will happen is difficult to say but with little improvement in Windows 7 clustering technology and SQL 2008 now supported on VMware it’s clear the direction things are now moving in.

So….Is Windows Clustering Virtually a Thing of the Past? Not quite yet but I don’t think we’ll be buying it many more birthday cards.