Nodes Health
How can you know a node is healthy? Most importantly how can you check it programmatically? It is one of the criterias to monitor validator service at scale.
Basic health check from the node
Most of the node clients have implemented the healthcheck API endpoint. For example, Lighthouse has an endpoint called "/lighthouse/health". It should return "HTTP 200 OK" when everything is setup correctly, all other responses mean the node software is having issue.
It is usually the most basic check and only tells you that software is up and running. So one shouldn't rely on it solely.
Different health symptoms
Here are a few common symptoms and their causes.
Some of the following symptoms are urgent, which means it has been fixed immediately. Some will only show that the system is degrading but it will It's important to be able to check a node's health How to check if a node is healthy To serve as a validator, both CL and EL need to be up to date with the network. There are a couple of techniques of how to check that.
All this health checks data should lead to a monitoring tool of your choice.
Internal health checks
Most of the nodes have exposed health checks APIs, that return HTTP 5xx
if the node
is not syncing properly, and HTTP 200 OK
if everything is okay. That is the most simple
and most basic version of the healthcheck.
Timestamp Checks
One more strategy is based on the timestamp of the latest block.
For EL that is a response to eth_getBlockByNumber("latest", false)
.
It has a field called timestamp
. By knowing the timestamp of the block and the block production rate (1 block per 12 seconds), it is
possible to see how "old" is the current block of the node.
Since sometimes the block proposals could be missed, it doesn't make sense to keep this threshold too tight, but if it is > 5 minutes old, it makes sense to mark the node as "unhealthy" and notify your monitoring system.
Source Of Truth Checks (Forks)
Finally, the case when the blocks are being synced, but you are on the wrong
fork. Detection of that could happen on the EL very easily, by using the block
hash of returned from eth_getBlockByNumber("latest", false)
.
You can compare these hashes across the nodes and also across the external sources of truth.