44
Crowdstrike Cockup (lemmy.world)
submitted 4 months ago by Tekkip20@lemmy.world to c/asklemmy@lemmy.ml

So as we all know on the news, the cybersecurity firm Crowdstrike Y2K'd it's own end customers with a shoddy non-tested update.

But how does this happen? Aren't there programming teams and check their code or pass it to a quality assurance staff to see if it bricked their own machines?

8.5 Million machines too, does that effect home users too or is it only for windows machines that have this endpoint agent installed?

Lastly, why would large firms and government institutions such as railway networks and hospitals put all their eggs in one basket? Surely chucking everything into "The Cloud (Literally just another man's tinbox)" would be disastrous?

TLDR - Confused how this titanic tits up could happen and that 8.5 Million windows machines (POS, Desktops and servers) just packed up.

you are viewing a single comment's thread
view the rest of the comments
[-] Balinares@pawb.social 13 points 4 months ago

This is actually an excellent question.

And for all the discussions on the topic in the last 24h, the answer is: until a postmortem is published, we don't actually know.

There are a lot of possible explanations for the observed events. Of course, one simple and very easy to believe explanation would be that the software quality processes and reliability engineering at CrowdStrike are simply below industry standards -- if we're going to be speculating for entertainment purposes, you can in fact imagine them to be as comically bad as you please, no one can stop you.

But as a general rule of thumb, I'd be leery of simple and easy to believe explanations. Of all the (non-CrowdStrike!) headline-making Internet infrastructure outages I've been personally privy to, and that were speculated about on such places as Reddit or Lemmy, not one of the commenter speculations came close to the actual, and often fantastically complex chain of events involved in the outage. (Which, for mysterious reasons, did not seem to keep the commenters from speaking with unwavering confidence.)

Regarding testing: testing buys you a certain necessary degree of confidence in the robustness of the software. But this degree of confidence will never be 100%, because in all sufficiently complex systems there will be unknown unknowns. Even if your test coverage is 100% -- every single instruction of the code is exercised by at least one test -- you can't be certain that every test accurately models the production environments that the software will be encountering. Furthermore, even exercising every single instruction is not sufficient protection on its own: the code might for instance fail in rare circumstances not covered by the test's inputs.

For these reasons, one common best practice is to assume that the software will sooner or later ship with an undetected fault, and to therefore only deploy updates -- both of software and of configuration data -- in a staggered manner. The process looks something like this: a small subset of endpoints are selected for the update, the update is left to run in these endpoints for a certain amount of time, and the selected endpoints' metrics are then assessed for unexpected behavior. Then you repeat this process for a larger subset of endpoints, and so on until the update has been deployed globally. The early subsets are sometimes called "canary", as in the expression "canary in a coal mine".

Why such a staggered deployment did not appear to occur in the CrowdStrike outage is the unanswered question I'm most curious about. But, to give you an idea of the sort of stuff that may happen in general, here is a selection of plausible scenarios, some of which have been known to occur in the wild in some shape or form:

  • The update is considered low-risk (for instance, it's a minor configuration change without any code change) and there's an imperious reason to expedite the deployment, for instance if it addresses a zero-day vulnerability under active exploitation by adversaries.
  • The update activates a feature that an important customer wants now, the customer phoned a VP to express such, and the VP then asks the engineers, arbitrarily loudly, to expedite the deployment.
  • The staggered deployment did in fact occur, but the issue takes the form of what is colloquially called a time bomb, where it is only triggered later on by a change in the state of production environments, such as, typically, the passage of time. Time bomb issues are the nightmare of reliability engineers, and difficult to defend against. They are also, thankfully, fairly rare.
  • A chain of events resulting in a misconfiguration where all the endpoints, instead of only those selected as canaries, pull the update.
  • Reliabilty engineering not being up to industry standards.

Of course, not all of the above fit the currently known (or, really, believed-known) details of the CrowdStrike outage. It is, in fact, unlikely that the chain of events that resulted in the CrowdStrike outage will be found in a random comment on Reddit or Lemmy. But hopefully this sheds a small amount of light on your excellent question.

this post was submitted on 20 Jul 2024
44 points (100.0% liked)

Asklemmy

43977 readers
493 users here now

A loosely moderated place to ask open-ended questions

Search asklemmy ๐Ÿ”

If your post meets the following criteria, it's welcome here!

  1. Open-ended question
  2. Not offensive: at this point, we do not have the bandwidth to moderate overtly political discussions. Assume best intent and be excellent to each other.
  3. Not regarding using or support for Lemmy: context, see the list of support communities and tools for finding communities below
  4. Not ad nauseam inducing: please make sure it is a question that would be new to most members
  5. An actual topic of discussion

Looking for support?

Looking for a community?

~Icon~ ~by~ ~@Double_A@discuss.tchncs.de~

founded 5 years ago
MODERATORS