I’m a big fan of netdata; it’s part of my standard deployment. I put in some custom configs depending on what services are running on what servers. If there’s an issue it sends me an email and posts into a slack channel.
Next step is an influxdb backend to keep more history.
I also use monit to restart certain services in certain situations.
My worst feeling is “I tried that two years ago but couldn’t get enough people interested, so I dropped it…”