Port-scanning the fleet and trying to put out fires

tetha · on March 25, 2024

> For that matter, the whole port-scanning agent/aggr combination shouldn't have needed to exist in theory, but in practice, independent verification was needed.

This is very much tickling me as one of the truths about operational work. I can't count how often I have said something like "Yes, technically we don't need all of that.... but..." Systems tend to become weird in production if many people use them.

Like we have a central component at work and most things have good client side load balancing and retries in place. Thus we decided against adding a server-side load balancer - more software and systems on the path are slower and more pieces that could fail.

Except, as we found out during an outage, ~2-3 critical requests into this central component in a third party component... aren't and just fail hard if they hit a bad node. And then everything escalates into a mess very quickly.

So now we have both kinds of load balancing. Isn't that wonderful?

eadmund · on March 25, 2024

> I can't count how often I have said something like "Yes, technically we don't need all of that.... but..." Systems tend to become weird in production if many people use them.

Minor note (of the literally/figuratively type): shouldn’t that be ‘yes, theoretically we don’t need all of that, but technically [or “practically”] we do’?

I think that ‘technically’ connotes actual correctness.

tetha · on March 25, 2024

I just wrote a longer piece, but deleted it because I realized what I was thinking.

I think this is the right way around, because I lean towards the "technically correct" side here. "Technically correct" always carries a bit of an ivory tower snobby attitude to me.

After all, the tech I'm working with has proven and working client side load balancing and retries, so there is no server side load balancing necessary. That's technically correct if you observe the system in isolation. And it's also correct that the bad software interfacing with it should fix their implementation.

It's just not correct in practice if terrible software, implementation bugs, or all manner of things beyond a simple, correct implementation start interfacing with the system. Then even a technically or clinically or theoretically correct system will fall apart because there's mud in places where mud shouldn't be able to get to. After all, a patch in a month won't help me with a preventable outage tomorrow.

bippihippi1 · on March 26, 2024

that's one of the top rules in programming. check your own errors, even if it probably can't happen because it will.

bombcar · on March 25, 2024

It's the old saying - "the difference between theory and practice is in theory there's no difference, in practice there is."

We refer to "unexpected critical things" as "load bearing flasks" - you never know they're load bearing until you get ye flask.

chatmasta · on March 25, 2024

Treating your systems like a black box is a good way to make them more robust and discover issues that you otherwise wouldn't notice with internal monitoring that bypasses them via shortcuts. It's also why bug bounties are so effective at uncovering bugs - external observers bring a new perspective.

brodouevencode · on March 25, 2024

Rachel always writes great stuff. There's one point I want to dig in on.

> since they were so low in the stack, when they broke, lots of other stuff broke

It is interesting that while engineers are far too happy to complain about dependency hell in terms of libraries and languages they don't complain enough about dependency hell when it comes to systems. Engineers should take a more proactive role in identifying and pushing whatever changes are required to prevent mass outages due to single points of failure. In other words - if more than a significant percentage of your business requires that X system be up and running, then your business has made a grave error. Engineers should be the first to point this out.

"That's the architect's job"

My thoughts on the usefulness of architecture aside, it's also the engineer's job. Not only should an engineer identify and be aware of system dependencies they should also build software that allows, prevents, and gracefully handles system dependency breakage. This is why the engineer should understand the business as much as the technical. (this is especially true for leads)

"But I inherited this system"

Yeah, it sucks. But now you've got work to do to mitigate risk. Risk mitigation outside of the context of cybersecurity is often overlooked. This is not an advocation for deploying to every data center/cloud provider in the world in the name of high availability. You'll need to do the math to make sure it works for those critical pathways.

EDIT: clarity

michaelt · on March 25, 2024

> Engineers should take a more proactive role in identifying and pushing whatever changes are required to prevent mass outages due to single points of failure.

In my experience even if you've eliminated single points of failure you can still get failures.

Sure, my server's got dual NICs and dual power supplies. But if the guy sent to replace server 12 in rack 345 accidentally gets server 12 in rack 346 I'm going to lose both NICs and both power supplies at once.

Sure, the network is fully redundant. But we naturally need to keep the settings on the two sets of kit in sync. We've had outages due to them getting out of sync in the past, not going to make that mistake again. So of course if there's a change to the firewall rules it automatically gets rolled out to both of the firewalls.

And so on.

brodouevencode · on March 25, 2024

Yes, this is true. There's no such thing as 100% redundancy or uptime in the real world without significant cost. The likelihood of that happening goes down exponentially with each successive layer of redundancy or fallback in my experience.

elric · on March 25, 2024

The author does not mention the software in question. My assumption is that it's ZooKeeper. Simply because in my previous role I had to build the exact same thing. We didn't have thousands of nodes, but we deployed (and managed) ensembles on prem at customers. Things would go tits up quite regularly, and would be difficult to recover (or even identify). Netcatting 4 letter magic words to ZooKeeper, parsing and aggregating the results was how we got a handle on things.

Eventually we built this into the application we were deploying, so that it could monitor its ZooKeepers, raise alarms, and give us insights into aggregated reliability data.

anon987123 · on March 25, 2024

Pretty sure it's about Google's Chubby. It's my understanding Zookeeper was largely inspired by Chubby.

snewman · on March 25, 2024

Concur. Zookeeper was absolutely inspired by Chubby, and this story fits in every detail with how things worked at Google back in the day. (Source: I was at Google from 2006 - 2010. Also, I believe the origins of Zookeeper are a matter of public record, though I'm too lazy to check right now.)

from-nibly · on March 25, 2024

What does zookeeper even do? I know it's supposed to be a key value store. But the only thing i ever hear about zookeeper is not people actually doing stuff with it, but rather doing stuff for it.

It seems like the most useless piece of software. Like someone wrote a concept down but said it was software.

elric · on March 25, 2024

It basically offers a set of distributed primitives which lets you build distributed applications with far fewer headaches. To be fair, it's mostly pretty good at what it does, but it's not super ops-friendly, and when things do go wrong, recovery isn't a whole lot of fun.

Apache Curator builds on top of ZooKeeper and turns the primitives into usable recipes, which is what I recommend to most people these days.

tetha · on March 25, 2024

Like, I'm a consul user, but I'd assume the use cases are similar.

One concrete example: Consul can provide you with locking and thus leader election with relatively little effort. Patroni, the postgres manager, can use a consul lock on a key to elect the postgres leader and consul ensures mutual exclusion.

Or in a similar way, I needed to run some data collection once a day, but the cluster consisted of 3 identical nodes and I didn't want to start treating one node specially.

I could use consul to hold a lock by using <consul lock whatever/lock collect-data.sh>. Once I had mutual exclusion, I could store the last time data was collected in a consul kv entry with <consul kv put whatever/last-run $( date +"%s")>. Then each script just grabbed the last time data was collected or 0 if missing and checked if it was 23ish hours ago. And there we go, all systems identical and fault tolerant once-a-day data collection without additional config to forget on top.

chatmasta · on March 25, 2024

I've thus far avoided using it, but my recollection is that it kinda snuck into mainstream stacks because it was part of the standard deployment model for the ~2015 era of "big data," with Hadoop, HBase, Solr, and the other usual suspects from the Apache Java ecosystem.

Nobody chose to have Zookeeper in their stack, but they needed it for these other services and so eventually they were stuck with it.

brodouevencode · on March 25, 2024

There are good use cases for it. If you have a cluster (or clusters) with many, many nodes running your application you need to be able to communicate changes/configurations to those nodes/applications. Zookeeper is really good at doing this because it itself is clustered (HA, redundant). Updating all of those nodes/applications manually is not a desirable thing, so you have the application call the zookeeper service to get those configuration changes.

The problem I see often is that it's wedged into places for which it's absolutely overkill. As in clusters with less than a dozen nodes, no service level determining such a need, etc. - it's often just a shiny object for architects.

mdaniel · on March 25, 2024

> you have the application call the zookeeper service to get those configuration changes.

as a note, both ZK and etcd have key watches such that one need not poll the key-value store, it'll push out notifications for changes to subscribers, making for super quick reactions in any such distributed applications

hooby · on March 25, 2024

> This engagement taught me that a lot of so-called technical problems are in fact rooted in human issues, and those usually come from management.

Kinda weird how often that sort of conclusion pops up, in otherwise completely unrelated articles...

brodouevencode · on March 25, 2024

Directors don't like to be reminded that they hired incompetent people.

svilen_dobrev · on March 25, 2024

not that weird..

/real-mode-on

i am slowly (few decades) giving up and having to accept that "software IS about human in the loop".. at least culturally. It wasn't back in the beginning maybe.. but now it more or less have become an institution, The_Software, with usual Shirky's principle applying etc..

that shuttle blew up because of (mis)culture.. and that wasn't (invisible) software there. i guess we are up to interesting times ahead..

contingencies · on March 25, 2024

So in summary they had: (1) no accurate inventory of what should be running or what was actually running (2) fundamental reliability issues with their cluster configuration not the least of which was a failure to implement basic self-integrity checks on services or clusters to ensure they were actually running and in a valid state (3) a wholly unimplemented mechanism for mapping high availability clusters across physical infrastructure (4) no network layer automation to ensure the accurate provisioning and segmentation of services and access to service data (5) unknown internal parties randomly accessing and altering cluster member configuration.

Yes, this describes a management problem. Sounds like shoot the messenger would be an easy de-facto choice for middle management, given how many years and spent dollars creating that situation must have taken...

For rapid use next time you can run lsof with various arguments to obtain a list of listening services on UDP and TCP ports. Cluster health and service status check wise, I have had good experience with https://clusterlabs.org/pacemaker/ + https://github.com/corosync/corosync

philsnow · on March 25, 2024

> [configs] x [checker] = are out-of-spec clusters due to the configs telling them to be in the wrong spot, or is something else going on? You don't really need to do this one, since if the first one checks out, then you know that everything is running exactly what it was told to run.

though it seems like a good step to run in a CI system, to try to preclude invalid configurations from ever showing up

Havoc · on March 25, 2024

Shouldn’t half of this be handled by the orchestration tech? Feels like reinventing K8S with Bandaid code frankly

I’m sure Rachel had her reasons but jikes

anon987123 · on March 25, 2024

How do you orchestrate the stuff the orchestration tech depends on?

That service (Chubby) is a dependency of Borg in the same way that etcd is a dependency of Kubernetes. Except the development of etcd got to skip making all the same mistakes that were made when Chubby was developed a decade earlier.

fragmede · on March 25, 2024

The reason is likely that this work predates Kubernetes entirely, which came out in 2014.

pests · on March 26, 2024

Modern lense, back then this was call cutting edge tech. These lessons are why we have K8S today.

yabbs · on March 25, 2024

"This engagement taught me that a lot of so-called technical problems are in fact rooted in human issues, and those usually come from management."

Oh hey, I didn't see you there. Welcome to earth.