The day of the blue screens of death

·

Earlier today, a broken update from CrowdStrike caused massive global outages in machines that are running Windows. A component named “Falcon”, that is responsible for endpoint protection, and runs in kernel-mode, causes machines to fail into blue screen of death (BSOD) and enter a boot loop. Among the industries that were affected are: airlines and airports, railways, healthcare, postal services, broadcasting, retails shops, banks, stock exchanges, etc.

While I could talk about the reason behind this outage, smarter people than me already analyzed it, so go read their reviews if you are interested. Instead, I want to talk about how we got there.

When you zoom out, and look at the situation from above, you realize two important facts:

  1. A lot of critical infrastructure is, still, running on Windows
  2. Windows is not secure enough

I haven’t used Windows in over a decade now. I first switched to Linux, and after getting a Mac from my former employee, I made a switch to macOS. Occasionally, I run Windows in an isolated VM due to development that I have to test on Windows. Back in the days I used to run Windows, I remember that the first thing I had to install on a fresh Windows machine — is an antivirus. Modern Windows still feels like a malware. It’s sad that Microsoft decided to focus on putting ads in their system, rather than making their system more secure by default. Instead, you have to reach to third party providers to protect your systems.

More over, these third party providers operate in a push manner, where updates are distributed automatically without any approval from the system administrator. In Linux based servers and systems, updates are, usually, performed in a pull manner, where the DevOps engineers or the system administrator goes and updates the system, and it’s underlying components. And the result of a push based system is evident in today’s outage. A private company that has access to most critical civilian infrastructures and can bring to an economical disaster. Hell, forget economy, I’ve read comments on HackerNews about hospitals going down during heart attack surgeries.

I could rant about how irresponsible CrowdStrike was during the rollout of this update. How they should have better rollout mechanisms, or how we should find the one’s responsible and prosecute them. But I hate it when after every incident people jump into “we need to find someone to blame/take responsibility” mantra. The sad reality is that nothing will happen, and no one will take responsibility. Instead, I want to talk about something that no one mentioned that much, and it’s the security through obscurity that many companies practice.

In order to understand, and prevent such cases in the future, we need to talk about why CrowdStrike is so popular. You see, as companies become bigger they start to drown in bureaucracy. Some of it is self-inflicted, while other is imposed by the government/regulating entity. To put it bluntly: companies need to check boxes.

The intention of “box checking” is good: we want to prevent outages and security incidents. But at some point, this ritual becomes so engraved in the minds of the developers and the management, that nobody dares to challenge it anymore. “We need to run security scans.” Why? Nobody knows, we just need in order to pass some audit. And sure, at the beginning you try your best to fix all vulnerabilities found by the scans. But then you meet a stubborn person, to whom you try to explain that this particular vulnerability can not be exploited because the package that is responsible for this vulnerability is used only during testing phase, and is removed from the final build. But someone needs to tick a box somewhere, and so you sweat and pull out your hair in a desperate attempt to fix this vulnerability, and eventually you end up with some hack, just to satisfy the scanner and your boss/auditor.

And sure, this example can be somewhat imaginary, but not far from the truth. Everyone who worked in medium to big sized corporate, have encountered horror stories about security through obscurity. A ritual, or a workflow that exist for the sake of a checkbox on some audit paper, but serves no real purpose in strengthening the security of the company and or the product. And so we end up outsourcing our security to third party companies who scare you with how 62 minutes of downtime can kill your business, and instead you should use their software. A sales rep that convinces a higher up manager, who convinces everyone else that we must use it for our own security. And there is no sane person in this chain to question the absurdity of the decision and the potential dangers it might inflict to your business, because they tell you: “we can’t fail”.

But the truth is: everyone can fail. Companies are made of people, and people make mistakes. The question is how you recover from it. And in the case of this outage, the recovery is very painful. It’s a long weekend where system administrators will need to recover every, single, machine. I don’t envy them. “Somebody should take responsibility.”

Share this:

Published by

Dmitry Kudryavtsev

Dmitry Kudryavtsev

Senior Software Engineer / Tech Entrepreneur

With more than 14 years of professional experience in tech, Dmitry is a generalist software engineer with a strong passion to writing code and writing about code.


Technical Writing for Software Engineers - Book Cover

Recently, I released a new book called Technical Writing for Software Engineers - A Handbook. It’s a short handbook about how to improve your technical writing.

The book contains my experience and mistakes I made, together with examples of different technical documents you will have to write during your career. If you believe it might help you, consider purchasing it to support my work and this blog.

Get it on Gumroad or Leanpub


From Applicant to Employee - Book Cover

Were you affected by the recent lay-offs in tech? Are you looking for a new workplace? Do you want to get into tech?

Consider getting my and my wife’s recent book From Applicant to Employee - Your blueprint for landing a job in tech. It contains our combined knowledge on the interviewing process in small, and big tech companies. Together with tips and tricks on how to prepare for your interview, befriend your recruiter, and find a good match between you and potential employer.

Get it on Gumroad or LeanPub