T-SQL Tuesday #128: Let’s Talk About Your Incident Reports

TSQL2SDAY Logo

Hello T-SQL Tuesday Readers! I’m sorry for being really late in getting this post out this week.

So! A couple of weeks ago, for this month’s topic, I asked everyone to post about something that broke or went wrong, and what it took to fix it. Last week, fourteen of you responded with your stories of woe so we could all learn from your incidents and recoveries in a constructive way, like pilots do. Here’s the recap of those posts, in the order that they came in.

What Everyone Had to Say

First, is Rob Farley with, “That time the warehouse figures didn’t match“… heh, I feel like I’ve heard this one before. But it turns out, no! this is new and awesome. Rob talks about a fundamental rule he has when loading data into a data warehouse: “protect the base table.” This is his first step in ensuring that data in the DW is correct, and as anyone who does DW or BI work knows–that is always the most important thing, because if trust in the data coming out of the reporting system is lost, it can be pretty hard to get it back.

John McCormack has “Optimising a slow stored procedure” next. John walks through his process of tuning up a stored procedure he had gotten an after-hours call about being slow enough that things were breaking. He’s got a good tip in here if you use SentryOne’s Plan Explorer, too. AND, there’s an added bonus of including something that I find frustrating when it happens. Basically: “This web page is usually really slow and the users were frustrated about it, but nobody ever told me!” Y’all! Tell us (IT, support, whoever) when you’re not happy, we’re usually happy to fix things to make your life easier!

Richard Swinbank talks about that time he unintentionally set a trap for himself in “Default fault.” Changing the default database for your login in SQL Server when you’re the DBA is all fine until you decommission that database! Richard includes great steps for digging yourself out of this hole with sqlcmd if you’ve “locked yourself out” of the instance when using SSMS.

Eitan Blumin’s post about a mis-behaving set of Azure VMs has a good surprise twist in it. This post struck me as a little hilarious, both because of the 32 hours bit, but also because I/we had just been talking about fullscan statistics update in our internal corpchat this week. I like fullscan stats in general, when they can be pulled off and are helpful (think non-uniform data distribution), but when they take down your AG, that’s, uh, bad. Don’t do that. Eitan also includes a nice set of takeaways at the end of the post.

SQL Cyclist/Kevin3NF I think wins the prize this month for having the longest contribution. Turns out, he has an ongoing series of blog posts of real-life stories along these lines, which include seven posts about fun goings-on that fit into this topic.

Next is Jason Brimhall with “Disappearing Data Files“… Jason shares a pet peeve of mine, which is having to fix the same thing over and over again. When that happens, it’s good that one at least has/knows the fix, I guess…. Anyway, Jason goes into using Extended Events as an audit tool to detect changes to tempdb files to help track down the root cause for a “recurring fix”, and reiterates that sometimes those root causes are hard to track down, and the best place to put yourself in for those situations is to be ready the next time.

Deborah Melkin says, “I feel like it’s been a while since I’ve joined the party.” Ohh, I’m pretty sure it’s not as long as it has been for me… Deborah doesn’t have a specific break/fix story, but talks about how experiences and dealing with problems over time makes it easier to address new problems as they come up. This is the core lesson (lesson? Process?) I’m wanting everyone to get this month, so I appreciate this perspective!

Hugo Kornelis has my favorite post of the month. If you’re the type who doesn’t read all of the T-SQL Tuesday posts and waits for the recap to find the one or two that look the most interesting, make sure this one is on your list. It’s short and to the point, and contains a fantastic lesson. Two, really, although one isn’t explicitly mentioned as a lesson. This one is the fact that if there’s something you do on a regular or reoccurring basis, you should have that process written down. Think of it like a checklist (flying reference ahoy!). But what Hugo makes clear is that even if you’re doing something different that you don’t have a process for, but is adjacent to that process, it’s still a good idea to reference that process. Just read Hugo’s post, it’s easier than listening to me 🙂

I felt bad just reading Aaron Bertrand’s post. Just, go read it. Promise. Take his advice.

Lisa Bohm brings us what I think is a heartwarming story about what transparency and honesty can bring to even professional relationships. Lisa tells us about how this worked out for her while working for an ISV when things went bad on a Friday night (it’s always a Friday night). In addition to the honesty part, she had another takeaway that we can all do better at sometimes: Shut up and start listening.

Next up is STEVE JONES, the one that got me into this damn mess in the first place. He at least takes ownership of that in the post, heh. Steve’s speaking my language here with “And too few companies share their learnings publicly.” …This is a crab of mine, as well, and I think a lot of IT shops would make fewer bone-headed mistakes if everyone was more willing to share what they’ve learned, NTSB-style. I understand why things are the way they are, and all, but it doesn’t mean I have to like it. Anyway, Steve has many encouraging words for learning from others in the first place, and how he has been fortunate enough to be able to work with people who have helped him build better systems. I always appreciate Steve’s writing, and this post is no exception.

Glenn Berry wrote about a place where I think we’ve all been–it boils down to not RTFM 🙂 But also, it involves building PCs, which maybe most of us used to do, but hardly any do anymore. I still do, but only once every, like, eight years, so… This was funny, because as Glenn walked through how he got here, I saw right where this was going; I mean, with four disks to hook up, I would have gone right for that nice quad of stand-offs, too! I mean, why split the cables up between two places when you could just do one! Yeah. Uh-huh. Yep. Glenn’s main takeaway is, basically read the documentation, which is always a good idea. As someone who really likes writing it (I’m aware of how broken I am), I also know how little this happens in practice.

Todd Kleinhans talks about Letting it Fail. This is true–sometimes things have to break to get the right peoples’ attention, or to show just how bad a situation could get. Todd has a story about just one of these situations. I don’t think any of us like things getting to that point, but sometimes there aren’t other options. Todd also has a good final takeway, about “doing nothing” being a valid option to a situation, and it really is. May not lead to a good outcome, but it is an option!

Tracy Boggiano starts out with a line that is a big “been there, done that” for me: “One fateful night while I was not on call, I got a call around 3:30 AM.” Ahh yes. You’re not on-call, but you wind up on the horn with Ops, anyway. Tracy’s story has some good head-shaking items in it, which is about how I expect a story that starts like this to end. Tracy has a good line towards the end: “Everything from the network, to the server hardware, to the database creates the system and working as a team is the only way to make sure things are configured to perform and not fail.” Ain’t that the truth…

And finally, my man Andy Yun talks about Presentation Disasters. Andy comes through here with probably the most aviation-related lesson of all: Paranoia Pays Off. Yessssss! Always have a Plan B–plus C and D if possible–so when things go pear-shaped, you already have a plan. Andy was presenting at GroupBy back in May, when his headset died. But he was ready! Good lesson for all and everything here, not just those of us slinging TSQL or flying airplanes.

And that’s it! This was the first time I’ve hosted T-SQL Tuesday, and I want to thank everyone who shared their stories with us this week!

T-SQL Tuesday #128: Learn From Others

Pilots do something that a lot of non-pilots will find fairly weird if not outright horrifying: We read accident (“crash”) reports. Some of us spend a lot of time reading accident reports, actually. Officially “Accident Reports”, these are put out by the US National Transportation Safety Board (NTSB) after investigation into a crash or an “incident.” In addition to aviation-related reports, there are highway and railroad reports, and even hazardous materials incidents.

Reports come in two flavors, a “preliminary” report, and ultimately, a “final” report after the investigation has completed. The final reports includes such items as conclusions and the probable cause of the accident or incident. To make life easier, they also include a Recommendations section, which, well, includes recommendations for how to keep this type of accident from happening in the future. These tend to be regulatory in nature, as they are geared towards the FAA.

The search form for aviation reports is here–https://www.ntsb.gov/_layouts/ntsb.aviation/index.aspx–if you’re, uh, thinking you want to get into this sort of thing.

Why do pilots do this? The rationale is pretty simple: To learn from the mistakes of others. Or, to learn how a bad day was kept from becoming a worse day after something broke.

What Does This Have to Do With SQL Server?

Great question. Besides the fact that I think piloting airplanes and DBA-ing are the same job, just with different scenery,  I wish we had this kind of transparency in the IT world when things went wrong. When a corporation has a big security incident, we’re likely not to hear a lot of details publicly about what went wrong and what was done to mitigate similar attacks in the future. This kind of information could help everyone. This is one of the things that cloud providers do quite a bit better: When something breaks, we get good information on what happened, why, and what’s going to be done about it. Of course, this is done because public cloud providers basically have to–if things went down a lot and we never heard why, that provider probably wouldn’t have a lot of customers for very long.

This brings me to T-SQL Tuesday.

Tell me (us all, obviously) about something recently that broke or went wrong, and what it took to fix it. Of course, the intent here would be for this to be SQL Server-related, but it doesn’t have to be. We can all learn from something going wrong in infrastructure-land, or how there was a loophole in some business process that turned around and bit somebody’s arm. It doesn’t even have to be all that recent–maybe you’ve got a really good story about modem banks catching on fire and that’s how you found out the fire suppression system hadn’t been inspected in years. Just spitballin’ here. If you’ve got an incident whose resolution can help someone else avoid the same problem in the future or improve a policy as a preventative measure, let us hear about it.

The Rules

Here are the rules as set out for the T-SQL Tuesday blog party.

  1. Your post should be published on Tuesday, 14 July, 2020 between midnight and 11:59:59 UTC/GMT/ZULU
  2. Include the T-SQL Tuesday logo in your post
  3. Link back to this invitation (usually done through the logo)
    (this will get syndicated, so link back to the original on airbornegeek.com, please)
  4. Include a comment on the invitation post or a trackback link
  5. Enjoy the chance to be creative and share some knowledge.

Microsoft Kinda, Sorta Broke RDP: That CredSSP RDP Error

If you’re running into this, it probably started on May Patch Tuesday, which was last week (the 8th). You are getting this error when you try to RDP to a/some servers:

An authentication error has occurred.

The function requested is not supported.

This could be due to CredSSP encryption oracle remediation.

Here’s what happened

An authentication error has occurred. The function requested is not supported. … This could be due to CredSSP encryption oracle remediation.

Errors make me sad

In March, a vulnerability in CredSSP (Credential Security Support Provider) was patched, which would affect authentication via RDP (this is outlined in advisory CVE-2018-0886).  However, it was implemented in such a way that the behavior change didn’t have to be “honored” by either the server or the client involved in an RDP session.

 

The intent was that this would be controlled by GPO in enterprise environments, and a new GPO setting to activate or deactivate this behavior was released at the same time.

GPO settings have a default value, which they will use when nothing has been explicitly set for a particular setting. In this case, the GPO has three possible values: Force Updated Clients (for servers to only take connections from patched clients), Mitigated (for both, and on a workstation means that it won’t fall back to old/insecure behavior when attaching to unpatched servers), and Vulnerable (for both, and means what it sounds like–anything goes!).

In March, the default behavior was set to “Vulnerable”, which means everything kept working for everyone. But in the May security rollup, the default setting for that GPO was flipped to “Mitigated” if there was not an explicit setting for it… end result being the core problem some are running into: Clients that have received the May update are no longer able to connect to RPD servers that have not received the March vulnerability fix.

Welp.

(For a bit more background on all of this, see this Microsoft blog post: https://blogs.technet.microsoft.com/askpfeplat/2018/05/07/credssp-rdp-and-raven/)

Good News: Easy Workaround

Fortunately, there’s an easy workaround that can be applied to any Windows workstation facing this behavior, with a couple caveats.

If you are getting the above error trying to RDP to a server, all you have to do is set the corresponding GPO on your local workstation to Vulnerable.

To set this, run “gpedit.msc” on your machine. When the Local Group Policy Editor launches, navigate to Computer Configuration\Administrative Templates\System\Credentials Delegation on the left side, and then find Encryption Oracle Remediation on the right. Open that up, flip it to Enabled, and then choose “Vulnerable” for Protection Level. Hit OK, close GPEdit & you’re done; the change will take effect immediately.

There are a couple caveats: First is, this means you’re choosing to operate in an unpatched situation, which I don’t recommend. The second is that you can only apply this GPO setting on your local workstation if you’re not in an AD environment where it’s been set at the domain level and it’s getting applied to your machine. If that’s the case, the AD-level GPO will stomp on your local setting if it’s different.

Again: This should only be a temporary measure. The real fix is to get the March updates on your servers so you can set your workstation back to at least Mitigated (really should be Force Updated Clients). It’s not going to be my fault if you leave things unpatched and in Vulnerable and then something bad happens!

Commentary

Some have been referring to this as a “bug” and…This isn’t a bug; I mean, the “breaking RDP” part isn’t a bug (the original vuln obviously is). This is 100% “system functions as designed.” There’s a vulnerability in a widely-used feature of Windows, and MS pulled the “better to be on the ground wishing you were in the air, than being in the air wishing you were on the ground” card here. Being a patch hard-liner (I saw too much shit in the early 2000s), I think this is fine. If you don’t like it, there’s a workaround. But, my attitude on this is tempered by the fact that it’s only listed as an “Important” update, and the exploitability seems a little bit out there. Maybe give us all a few more months to notice?

Regardless, I DO think there was a communication failure here, though. Since few people read patch notes on a regular basis (I don’t even, anymore), relying on those to get the message to people isn’t going to work. Even that PFE blog post–which is great–is still a little bit of shouting into the void until someone runs into the problem and goes looking for a solution.

I don’t know what to do about this part, because there’s really just not a mechanism to deal with it. And really, do we need another thing to watch for alerts and stuff? Plus, breaking changes happen on a regular basis… where do you draw the line? And what, should they have made the RDP client throw a pop-up message about this? That seems like an awful big hammer.

I guess I’m going to have to go back to reviewing KB articles for patches again :-/

Azure Infrastructure pre-con ahead of #SQLSatCleveland

Microsoft Azure logoSQL Saturday in Cleveland, Ohio is next week, on February 3rd. If you’re in the area or can easily make it there, I hope that you can come out for a great day of free SQL Server training. I enjoy presenting at SQL Saturdays; they’re fun and educational days for speakers and attendees, alike. Last time we were in Cleveland it had snowed overnight when it was time to leave town on Sunday morning. I’ve lived even longer in the south now, so if that happens again, it’ll be even more fun this time.

In addition to my session on Saturday, where I will talk about using database projects in SSDT/Visual Studio, I’ll also be presenting an all-day session Friday on Azure Infrastructure. Planning and designing your infrastructure is just as important in the cloud as it is when building new systems on-premises. As Azure continues to grow and expand around the world, more companies will be choosing to migrate (or deploy new) services to the public cloud. Understanding the underlying components is imperative to maximum-performance and highly-successful Azure deployments and hybrid migrations. In this session, we’ll cover infrastructure fundamentals with a bit of a focus on deploying and running SQL Server in Azure; however, there will be plenty of general background discussion that can be used for any workload.

Registration for this precon is available here, on EventBrite: https://www.eventbrite.com/e/azure-infrastructure-presented-by-kerry-tyler-tickets-41688096218, with information about the overall SQL Saturday event available here: www.sqlsaturday.com/708

Saturday is free, but tickets for the full-day precon are $150.

I hope to see you next weekend!

Let’s Talk About Group Managed Service Accounts

SQL Server service dialog, Log On tabWe all know the best practices for SQL Server service accounts–domain account (if you’re using Active Directory), non local admin, different one for each service (and server/instance), etc, etc. These are, of course, good best practices and they should be followed as closely as possible in Production and on servers/instances that house Production data.

A problem arises if you have more than just a couple-few servers or run some of the BI components, however. The number of service accounts involved in your SQL Server plant could be very large, necessitating an incredible amount of overhead when it comes to managing those accounts. This goes beyond simply creating and assigning them–chances are good that there are policies in place that require changing passwords. User accounts, service accounts, and other automation accounts likely all fall under this umbrella. If you’re lucky, maybe non-user accounts have a longer change interval, but it’s still something that is going to need to be done on a regular basis. In large environments, this could take an excruciatingly awful amount of time to do.

All of this is not to mention the human factor involved here. One of the recurring themes in a couple of my presentations is making an effort to automate as many things as possible to remove the human from the process. Not that we’re bad, but there are some things, especially tedious and repetitious tasks, where dumb things go wrong simply because of the nature of the work. Changing a bunch of service account passwords is definitely one of them. There used to be two types of sysadmins: those that have changed a service account’s password but forgot to update and restart the service itself, and those who will.

Enter Group Managed Service Accounts

Group Managed Service accounts (gMSAs) are a way to avoid most of the above work. They are special accounts that are created in Active Directory and can then be assigned as service accounts. They are completely managed by Active Directory, including their passwords. This means no more manual work to meet the password-changing policy–the machine takes care of that for you.

There are other security-related controls that can be gained with them, but this is the major feature.

I’ll also note that you–the DBA–are likely to need some help from your AD admin to get these set up. They’re going to need to actually create the accounts for you in the system, and there may be some changes needed to their AD configuration in order to support them. They’ll also need to have a Windows Server 2012 (or R2) domain controller in their domain, but I’d hope today that’s not going to be a hurdle to overcome.

Since I’m mostly here to talk about SQL Server, I’ll note a couple of different support situations. gMSAs are supported from SQL Server 2014 and on running on Windows Server 2012 R2 and on for everything you can do with SQL Server–standalone instances, Failover Clusters, Availability Groups. Just plain Managed Service Accounts (MSAs) can go back a little further, but they only support standalone instances of SQL Server.

From a non-SQL Server perspective, one of the major disadvantages of gMSAs is that one can’t just use them everywhere. Services have to be specifically designed to support the use of these accounts, and that’s not going to be the case everywhere.

Since this isn’t exactly a new feature, there’s plenty of documentation and blog posts out there to read about this feature and the various requirements to implement. There’s a great overview and setup blog post on MSDN here: https://blogs.msdn.microsoft.com/markweberblog/2016/05/25/group-managed-service-accounts-gmsa-and-sql-server-2016/

That post links to this old TechNet article, which still is a pretty good resource for understanding what these things are and a little more detail on what is going on in the back-end: https://technet.microsoft.com/en-us/library/hh831782.aspx

Finally, my coworker Joey has a slightly older writeup here, https://joeydantoni.com/2012/12/14/group-managed-service-accounts/, that walks through the process of setting this up. Note that some of the requirements have changed since that was written, but the general process remains the same.

gMSAs are a nice feature that aren’t too onerous to setup, but go a long way to make your life easier and your data far more secure.