My comment as someone who used to do this professionally:
4 golden signals:
Latency (how long the transaction / query / function takes)
Traffic (queries per second)
Errors (errors per second)
Saturation / Fullness (how close to maximum traffic are you. This can be I/O, how close are you to maximum bandwidth, memory: how close are you to running out of RAM, threads: how many serving threads are you using out of the total thread pool, CPU: how much free CPU capacity do you have)
Don't veer too far from measuring those key things. You might be able to get many other rates and values, but often they're derived from the key signals, and you'd be better off monitoring one of the golden signals instead.
3 types of alerts:
Pages, as in "Paging Doctor Frankenstein". High priority that should interrupt someone and get them to check it out immediately.
Tickets / Bugs. These should be filed in some kind of bug reporting / ticketing system automatically. That system should be one where someone can look up the day's or the week's bugs and see what needs investigation, track it, and then ultimately resolve it. This level of alert is for things that are serious enough that someone should be periodically checking the bug / ticket system to see if anything needs investigation, but not important enough that someone should drop what they're doing and look right away.
Logs. Write info to storage somewhere and keep it around in case it's useful for someone when they're debugging. But, don't page anyone or create a ticket, just keep the info in case someone looks later. Graphs are basically a form of logs, just visual.
Tempting as it may be, never email. Emails just get ignored. If it's high priority you should page. If it's not that high priority, file a bug / ticket.
For latency, use distributions: 50th percentile latency, 90th percentile latency, 99th percentile latency, etc. Meaning for 50th percentile, half the users have this much latency or better. For 99th percentile 99% of users have this much latency or better, 1% of users have this much latency or worse. The reason for this is that an average latency is not very useful. What matters are the outliers. If 99% of operations complete in 500 ms but 1% take 50s, the average latency will still be approx 500 ms, but that one operation that takes nearly a minute can be a sign of something, either breakage or abuse.
Black and white box monitoring are both important.
White box monitoring is monitoring as someone who knows the internals of the system. Say monitoring the latency of the GetFriendsGraph() call. As someone who knows the code you know that that's key to performance, it has a DB as a backend, but there's a memory cache, and so-on.
Black box monitoring is monitoring viewing the system as a black box whose internals you pretend you don't understand.. So, instead of monitoring GetFriendsGraph() you monitor how long it takes to respond to loading http://friends.example.org/list/get_buddies.jsp or whatever. That will include time doing the DNS lookup, time going through the load balancer, querying the frontend, querying the backend(s) backends, and so on. When this kind of monitor experiences errors you don't know what the cause is. It could be broken DNS, it could be broken load balancers, it could be a DB crash. What it does tell you is that this is a user-visible error. With white-box monitoring you may know that the latency on a certain call is through the roof, but the black box monitor can say that it isn't an issue that is actually affecting most users.
In terms of graphing (say GraphViz or whatever), start by graphing the 4 golden signals for whatever seems to be important. Then, treat the graphs like logs. Don't stare at them all the time. Refer back to them when something higher priority (bugs / tickets or alerts) indicates that something needs an investigation. If additional graphs would have helped in the investigation, add more graphs. But, don't do it just to have things to look at. Too many graphs just becomes visual noise.
This is pretty accurate to what I do professionally.
The point made here about the Average user experience is super super important. It’s good to know what that is for several reasons. Mainly performance tuning. But when it comes to trying to prevent disasters the middle isn’t useful.
Another thing to add. This came to me recently. There are two kinds of graphs and dashboards, those for technical folks and those for managers and non-technical folks. You want to develop both or one with variables to then simplify the graphs/dashboard. Annotations and good titles IMHO are good. Some folks prefer to have technical graph titles. I get the draw but I have to deal with multiple leads, C levels, project managers, and managers that don’t care about the technical stat just where it is compared to where it should be
I haven’t looked really. There is a DevOps community I think. Haven’t seen any SRE (site reliability engineer) or monitoring communities, One will probably pop up sooner than later.
My comment as someone who used to do this professionally:
4 golden signals:
Don't veer too far from measuring those key things. You might be able to get many other rates and values, but often they're derived from the key signals, and you'd be better off monitoring one of the golden signals instead.
3 types of alerts:
Tempting as it may be, never email. Emails just get ignored. If it's high priority you should page. If it's not that high priority, file a bug / ticket.
For latency, use distributions: 50th percentile latency, 90th percentile latency, 99th percentile latency, etc. Meaning for 50th percentile, half the users have this much latency or better. For 99th percentile 99% of users have this much latency or better, 1% of users have this much latency or worse. The reason for this is that an average latency is not very useful. What matters are the outliers. If 99% of operations complete in 500 ms but 1% take 50s, the average latency will still be approx 500 ms, but that one operation that takes nearly a minute can be a sign of something, either breakage or abuse.
Black and white box monitoring are both important.
White box monitoring is monitoring as someone who knows the internals of the system. Say monitoring the latency of the GetFriendsGraph() call. As someone who knows the code you know that that's key to performance, it has a DB as a backend, but there's a memory cache, and so-on.
Black box monitoring is monitoring viewing the system as a black box whose internals you pretend you don't understand.. So, instead of monitoring GetFriendsGraph() you monitor how long it takes to respond to loading http://friends.example.org/list/get_buddies.jsp or whatever. That will include time doing the DNS lookup, time going through the load balancer, querying the frontend, querying the backend(s) backends, and so on. When this kind of monitor experiences errors you don't know what the cause is. It could be broken DNS, it could be broken load balancers, it could be a DB crash. What it does tell you is that this is a user-visible error. With white-box monitoring you may know that the latency on a certain call is through the roof, but the black box monitor can say that it isn't an issue that is actually affecting most users.
In terms of graphing (say GraphViz or whatever), start by graphing the 4 golden signals for whatever seems to be important. Then, treat the graphs like logs. Don't stare at them all the time. Refer back to them when something higher priority (bugs / tickets or alerts) indicates that something needs an investigation. If additional graphs would have helped in the investigation, add more graphs. But, don't do it just to have things to look at. Too many graphs just becomes visual noise.
This is pretty accurate to what I do professionally.
The point made here about the Average user experience is super super important. It’s good to know what that is for several reasons. Mainly performance tuning. But when it comes to trying to prevent disasters the middle isn’t useful.
Another thing to add. This came to me recently. There are two kinds of graphs and dashboards, those for technical folks and those for managers and non-technical folks. You want to develop both or one with variables to then simplify the graphs/dashboard. Annotations and good titles IMHO are good. Some folks prefer to have technical graph titles. I get the draw but I have to deal with multiple leads, C levels, project managers, and managers that don’t care about the technical stat just where it is compared to where it should be
Thanks for the reply. Do you know of any Fediverse community for people into things like monitoring, paging, alerts and occasional sleepless nights?
NP.
I haven’t looked really. There is a DevOps community I think. Haven’t seen any SRE (site reliability engineer) or monitoring communities, One will probably pop up sooner than later.