Process Restart Alert
Why do you receive this message?â
The restart alert is sent when the system detects that the node process was restarted. You receive it when:
- The
process_start_time_secondsmetric changed (difference > 1 second) - The node process was restarted
- The system detected the change during metric collection
âšī¸ Information: Restart can be intentional (manual restart) or unintentional (crash, system reboot). It's worth checking the cause.
What does the message contain?â
- Alert image (IMG_0499.png) - Sent as the first message
- Status - âšī¸ Node Restart Detected
- Restart time - Exact timestamp of restart (from Prometheus metric)
- Source -
process_start_time_seconds(process start timestamp) - Summary - Explanation of possible causes
- Recommended checks - Commands to diagnose the cause
How should you react?â
-
Check if the system restarted:
last -x reboot | headIf the system restarted, the node restart was probably caused by system reboot.
-
Check system boot logs:
journalctl -bCheck if there were any problems during boot.
-
Check node service logs:
journalctl -u redbelly.service -bCheck logs from the last boot.
-
Check synchronization status:
- Open the monitoring panel
- Check if the node is synchronized
- Check peer count
-
If the restart was unintentional:
- Check logs for errors
- Check if there was OOM kill (
dmesg | grep -i oom) - Check if there were disk problems
Sending Logicâ
| Element | Details |
|---|---|
| Trigger | collect_metrics_for_endpoint() function during metric collection |
| Check frequency | Every 15 seconds (each metric collection cycle) |
| Metric | process_start_time_seconds from Prometheus (Unix timestamp) |
| Conditions | âĸ abs(current_start_time - last_process_start_time) > 1.0âĸ Telegram alerts enabled âĸ Metric available in both measurements |
| Tolerance | 1 second (to avoid false alarms on small differences) |
| Format | First image (IMG_0499.png), then text message in Markdown |
| Duplicate prevention | Alert sent only on change of process_start_time_seconds |