Alerting
Last updated
Last updated
Alerting should be very specific. It’s easy to just set thresholds to every possible monitored metric and add alarm to it. But that could lead to fatigue, distractions and also ignoring alerts, see for more details.
Alerts should never be ignored, even if you think you have an idea what caused them.
For good tips on alerting in general, see .
SaaS:
Pager Duty
VictorOps
There is a practice in every cloud service, called “being on-call”. That means that at some moment in time there is a person responsible for reacting to alerts, regardless of when they happen.
That means being ready to act in the middle of the night, in the weekends, etc. That is a tedious and tiring position to be in, so it is better to rotate people often on that.
An example of the on-call policy could be found in this .
What to look for first?
Is node up and running? Is validator client up and running? CPU/RAM/Disk space okay?
Read the logs. Are there enough peers? Is number of validators found by validator client as you expected?
Is your node in sync/is it syncing? If so, is it on the rigth fork? Take and check it agains any public block explorer or in a community.
Is the network finalizing? -- should be moving every 6.2 minutes.
Inactivity leak means your node was chosen to do a certain duty (attesting for the chain head or producing a block) and didn’t do its job in time.
Inactivity leaks have relatively small penalties. They will degrade the performance of the Validator in terms of the yearly yield, but it take a long time for them to
That means that you have some choice in how to handle them:
react ASAP — use if you have a proper DevOps team and you want to optimize the node performance and the best APY possible. Ensure regular rotations in the on-call team.
react only during “business hours”, but on weekends — only notify, say, from 9:00 to 21:00 every day. That greatly reduces the strain on the on-call personal.
react only during business hours — same as (2) but don’t notify on weekends.
Possible Network Attacks
What channels to reach out if you think its a network attack?
Possible Vulnerability
What channels to reach out if its a possible vulnerability?
Explanation of basic alerts they should incorporate (delayed slots, low participation rate, block processing times, etc)
How long to store data for?
Generic recommendation on alerting/metrics best practices
What to look for first? When do you raise an alarm? First steps required during incidence response