Don’t Forget About DR for Your DR

Here’s a scenario:

Let’s say you’re a small- or medium-sized company with either an on-premises data center in your office/building or in a “regular” co-lo nearby in the same metro area. You’ve got a mission-critical online presence, so in order to handle either a large-scale disaster for your geographic area or one just in your server room, you’ve written, implemented, and tested a disaster-recovery plan. Another co-lo a couple of states over is set up to be able to step in if needed, and this process can even be completed by non-technical resources in a couple of hours.Need a Plan C

This is a fairly-sound plan. However, what’s Step 2 after Something Bad™ happens to the primary data center and everything fails over to the DR site? What if Something Bad™ is long-term? You’re back to square one, with a single data center. Or where do you put the quorum file share for your AG?

Or, another situation: What if something happens to your DR site? Then what?

Almost Been There, Done That

One of our clients–who has a really good DR plan similar to the one described above–had a brush with this scenario earlier in the year. Their DR data center is in the Houston area, and in the aftermath of Hurricane Harvey, there were some concerns about the status of the DC. The DC itself was fine, but key support personnel would not have been able to get to the site for a number of days if there were such a need.

This situation did a good job of spurning conversations centered around what to do in this situation and what Plan C might look like.

Now What?

The point of this post is mostly to get you thinking about this scenario. Getting DR in place can be enough of a battle itself (I know), but ensuring that what happens next after a potential disaster is considered and planned for is another important step.

What this plan may look like is likely dependent upon what the “first stage” DR plan looks like. Not everyone can afford an additional site, especially if it’s a smaller company. And, let’s be honest: we could sit here all day and what-if burning data centers, but at some point, the return on this investment will become very questionable.

Although this looks/smells like a shameless plug for cloud/Azure, the public cloud is an excellent option to consider here. Even if your company is 100% on-premises with a classic hardware/virtualization platform, keeping a copy of critical systems’ backups up-to-date and available in the cloud is relatively inexpensive.  This “cold DR” process is a very easy-to-implement step to safeguard against a multi-phase or long-term disaster. In the event that these backups are needed, there’s the option of spinning up a group of VMs in the cloud to restore to. At the very least, this cold backup solution will be more-accessible than your current offsite backups if new on-prem servers are stood up somewhere to get the lights back on.

High Availability in SQL Server Standard Edition (or Semi-Lack Thereof)

SQ Server 2012 brought about some major changes to the various High Availability schemes supported by the product. The most major of these is the introduction of AlwaysOn Availability Groups. As described early in that MSDN article, these can be over-simplified summed up as “enterprise-level database mirroring.” This is not quite the same thing as the existing Failover Clustering (which is still available), although AGs do require and run on a cluster.

From a Business Intelligence perspective, it’s a somewhat different situation: Analysis Services is cluster-aware, so it can be used in a Failover Clustering situation. SSRS has scale-out capabilities, which, if architected with it in mind, can provide some form of redundancy. SSIS has nothing built-in for high availability, which one could expect for an ETL solution (I could go on for a while about why HA ETL is dicey, but that’s not what we’re here for). AlwaysOn AGs don’t exist for any of these products, possibly because what the feature is/does doesn’t make sense for anything except, I would argue, SSAS. I’m mostly not here today to talk about BI HA, but I will come back to it briefly.

2012 ChangeS, Plural

With the introduction of AGs as “beefy mirroring”, it didn’t make sense to continue to support multiple, awfully similar, features. The result is Database Mirroring, introduced in SQL Server 2005, is deprecated as of SQL Server 2012. It’s not in the “Next Version” list, since this is the first time it has appeared, so there are at least two major version releases before it will go away entirely. (With SQL 2014 announced last week at TechEd North America, stay tuned for its documentation release to see if Mirroring has moved closer to death.)

The point is, it will be going away. What to do? Logic would suggest the intended migration path for DB Mirroring users would be to move to AlwaysOn AGs. Sounds like a good enough idea. I mean, since as mentioned, Microsoft themselves describes it as enterprise-grade mirroring, Standard does do two-node clustering, so let’s do that!

When They Said “Enterprise”, They Really Meant It

There is a potential problem with that logic. Specifically if one has been using (or would like to start using) the synchronous-only flavor of DB Mirroring available in the Standard edition of SQL Server, the available options have gotten realllly thin. See, AlwaysOn AGs aren’t available in the Standard Edition of SQL Server; at least not in 2012. This means if a company is running a few mission-critical DBs in a mirroring setup with Standard edition all-around, that setup’s upgrade path is very limited: in order to keep it, they wouldn’t be able to upgrade past whatever future version is the last one that includes Mirroring. For any other company who would like to deploy such a setup in the future, there will be a point in time when they won’t be able to—the feature won’t exist in their desired Edition of SQL Server.

Unless, of course, they want to upgrade to Enterprise. That’s…well…expensive. It always has been, but for most modern hardware, it’s a bigger jump from Standard to Enterprise than it used to be. There are plenty of other reasons worth spending the extra money to upgrade to Enterprise, but just because a system or DB is nosebleed-mission-critical doesn’t mean it’s huge, requiring table partitioning or something to run well. Especially at a small-to-midsize company, HA might be the only Enterprise Edition features needed. Is it worth the money? Wouldn’t it be nice if things stayed closer to how they are now?

What Should it Look Like?

This is the whole point of why I’m here: What do I want the HA situation to look like in Standard Edition?

I do not believe that High Availability options not named “Log Shipping” should be Enterprise-only. At least not entirely. I’m not saying Microsoft should make all four secondaries (eight in 2014) available in Standard. Nor am I 100% convinced that they should be readable in Standard like they are in Enterprise. I think that a single secondary, living on a second instance on the other node of that 2-node cluster allowed in Standard, usable for failover purposes only, would do the trick.

This starts to look similar to the mirroring setup currently available in Standard, and that’s exactly what I’m trying to do. I don’t think we should get everything without having to pay for it—ie, all of the nice fancy stuff in Standard. There are features that 100% should be only available in Enterprise. Full-on readable secondaries, with SSRS reports or SSIS load jobs pointed at them, is one of those things that should require a fatter check to MSFT.

Semi-Related BI Commentary

Since I’m filling out the SQL Server section of my Christmas List, I was going to say it would be nice to have AlwaysOn AGs for SSAS, too. After thinking about that for 15 more seconds, I realized that was dumb, since, due to the nature of SSAS, it would be pretty pointless—we would get the same thing out of some kind of scale-out architecture.

Such an architecture already exists, but I think it is terribly kludgey and almost has to be fragile in practice. So, why not make a “real” scale-out system based on the AG architecture? SSAS is cluster-aware already; just need some kind of thing to automate copying of the freshly-processed data from the Primary (“Data Processing Server” in that article) to the Secondaries (“Data Access Servers”). Add some awareness of this process to the existing AG listener process/service, and boom! I’ve never had to deal with quite that big of an SSAS environment, so this might be a terrible idea, but it sounds good in my head!

Except…I would expect this to be Enterprise Edition-only functionally. Sooo…nevermind.

T-SQL Tuesday #11: The Misconception of Active/Active Clustering

TSQL Tuesday

October's TSQL2sDay: Hosted by Sankar Reddy

I was expecting my inaugural T-SQL Tuesday post to not happen this month… Until just yesterday, when I found myself helping explain to someone what an Active/Active SQL cluster was and what it wasn’t. Ah hah! I expected this to be a topic already covered fairly well, and as expected, I found a couple of really good articles/posts by others on this topic. I will reference those in a bit, but I can cover some more general ground first.

The misconception that was I was explaining yesterday morning is the same one that I’ve explained before and have seen others do the same: An Active/Active SQL Server Cluster is not a load-balancing system. It is, in fact, purely for High-Availability purposes. If what you are really after is true load-balancing, unfortunately, you will have to go elsewhere, to something like Oracle RAC.

When I say “true load-balancing”, I mean being able to load-balance a specific set of data across multiple pieces of hardware. What you can do with Active/Active clustering is load-balance a set of databases across multiple pieces of hardware. This is useful if you have a SQL Instance with multiple very heavily-utilized databases—one or more of these DBs could be pulled out to its own instance on separate hardware. This is not done without potentially serious tradeoffs, however, which will be covered in a bit.

The underlying reason for this is that Windows Clustering is only designed for HA purposes. With its “shared nothing” approach to clustering, any given resource within the cluster (say, the storage LUN that holds your accounting software’s MDF file) can only be controlled by one node of the cluster at a time. If the accounting DB’s home LUN is currently attached to the first node in a cluster, it’s going to be pretty hard for the second node to service requests for that database.

OK, what is Active/Active Clustering, then?

I like to think of it as two Active/Passive clusters that happen to live on the same set of hardware. To illustrate:

Active-Passive Cluster Example 1

This example cluster is a two-node Active/Passive cluster that contains a single SQL Server Instance, CLUS1\ACCOUNTING. This Instance has its own set of drive letters for its DB, Log, and TempDB storage (say, E:, F:, and G:). If both nodes are healthy, Node2 never has anything to do. It sits around twiddling its bits, just waiting for that moment when a CPU melts down in Node1, and it can take over running the ACCOUNTING instance to show how awesome it is.

Active-Passive Cluster Example 2

This is another two-node Active/Passive cluster, running a single instance, CLUS2\TIME (you know, because internal time tracking software is that important). Like CLUS1\ACCOUNTING above, it has its own set of drive letters for its storage needs (M:, N:, and O:)In this case, the SQL Server Instance is on Node2, leaving Node1 with nothing to do.

What do we get if we mashed these two clusters together?

Active-Active Cluster Example

We would get this, an “Active/Active” SQL cluster. It’s logically the same as the two previous clusters: it just so happens that instead of the second node doing nothing, it is running another independent instance. Each instance still has its own set of drive letters (E:, F:, and G: plus M:, N:, and O:). Just like in the “Accounting Cluster” above, if Node1 were to fail for some reason, the CLUS1\ACCOUNTING instance would be started up on Node2, this time, alongside CLUS2\TIME.

So, is doing this a good idea?

Maybe. Above, I talked about how a single instance Active/Passive cluster with more than one heavily-used DBs could be re-architected into an Active/Active cluster with some of the heavy DBs on one instance living on one node and the rest of them on a second instance living on the other node in the cluster. This gives you both High Availability and you’ve spread the load out between two servers, right? It does, but what happens if one of the nodes fails and the two instances “clown-car” onto the remaining node? Are there enough CPU & RAM resources on the remaining server to support the combined load?

That illustrates one of the major things to consider when thinking about deploying an Active/Active cluster. Tim Ford (Blog | @sqlagentman) has an article over at MSSQLTips that goes into detail about making this decision.

In addition, Aaron Bertand (Blog | @AaronBertrand) has a sweet little piece of automation to help manage resources when trying to squeeze the maximum amount or size of clowns into a single car without someone’s foot sticking out in the breeze. Of course, it’s something that you hope never actually gets used, but it’s a good way to wring the most performance out of an Active/Active cluster.

Conclusion

I don’t know if this is the best way to explain this concept, but I like it. The Active/Passive cluster concept is usually understood, so I believe it is a nice stepping stone to get to the truth about Active/Active clusters. As with everything, there are tradeoffs and gotchas that need to be considered when designing server architectures like this, and the above-mentioned others have gone a long way to cover those topics.

There we have it, my first T-SQL Tuesday post. Hopefully it is useful and not from too far out of left field 😉