Hacker Newsnew | past | comments | ask | show | jobs | submit | ecoffey's commentslogin

I certainly agree in spirit that the alerts are important, and should be actionable. But I wouldn't start at just "looking at the service" and then trying to define the first set of alerts.

Instead I would move up a level and start with a SLO for the various "business level" metrics you might care about. Things like "request latency", "successful requests", etc.

Then use the longer lookahead "error budget" burndowns to see where your error budget is being spent, and from there decide 1.) if the SLO needs adjusting, and/or 2.) if an alert is appropriate.

To cleanly answer those questions and iterate you'll need metrics, dashboards, traces, and logs. So then you're not just making dashboards because "its best practice", you're creating them to specifically help you measure if you're meeting your stated service objectives.

https://sre.google/sre-book/service-level-objectives/


SLO timelines are usually over 7d, 30d etc no? and also often don't work that great for backend services in my experience ... they can't give you the level of reactivity that defining alerts about things you care about give you. I'd argue that moving from that direction upwards to figure out what alerts to aggregate and define SLOs around, rather than the other way around in those cases.

Would love to hear more, since I largely used SLOs on backend services (which in turn called other services that also had their own SLOs).

As far as timespans for the error budget consumption, I’ve seen 1 hour -> 1 day -> 1 week. The 1 hour error budget rate would be a page and the others would be low priority.

So you could either keep that as the alerting and/or use the error budget “look ahead” to see if there are more specific alerts you need.


Interesting! Reading the headline before the article, my brain immediately thought of "jitter".

I wonder if you could extend the `In-process synchronization` example so that when `CompleteableFuture.supplyAsync()` thunk first does a random sleep (where the sleep time is bounded by an informed value based on the expensive query execution time), then it checks the cache again, and only if the cache is still empty does it proceed with the rest of the example code.

That way you (stochastically) get some of the benefits of distributed locking w/o actually having to do distributed locking.

Of course that only works if you are ok adding in a bit of extra latency (which should be ok; you're already on the non-hot path), and that there still may be more than 1 query issued to fill the cache.


Northguard doesn’t look like it’s been open sourced? I’d be curious to know how it compares to Apache Pulsar [0]. I feel like I see some similarities reading the LI blog post.

0: https://pulsar.apache.org/


That is tough, I’m sorry for your loss.


Thank you for the condolences.


In my experience microservices are easier to manage and understand when organized in a monorepo.


That indicates a strong coupling between those microservices.


Even loose coupling is still coupling. For the things that have to be coupled having the code organized in the same place, being able to easily read the source for “the other side”, make a change and verify that dependees test still pass, etc is immensely powerful.


I like using the term “distributed monolith” for those systems with very tightly coupled microservices.


Monorepo is one of few things I’ve drunk the koolaid on. I joke that the only thing worse than being in a monorepo, is not being in one.


Thanks, I'll steal that one! :-)


Bartosz links to it in the Further Reading section, but wanted to highlight the Wristwatch Revival YouTube channel[0] as well. Really great content and very understandable after reading the article!

0: https://www.youtube.com/c/WristwatchRevival/videos


Love this channel, perfect ASMR for nerds. Marshall is of course also one of the foremost Magic: the Gathering podcasters and commentators.


> perfect ASMR for nerds.

Haha yeah I like to have this on it the background when I’m doing other things.



> LSP optimizes writing the code.

I would actually phrase that as “LSP optimizes for understanding” (which is of course important for writing code).

For example, when doing code reviews I routinely pull the branch down and look at the diff in context of the rest of the code: “this function changed, who calls it?”, “what other tests are in this file?”, etc. An IDE/LSP is a powerful tool for understanding what is happening in a codebase regardless of author.


Makes me think of this project: https://www.vesta.earth/


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: