Google Docs - Monitoring-server-logs
The objective is to monitor the logs of the server and trigger an alert if an error occurs in the logs.
Shell
Although Python is more familiar, a shell script might be better suited for this task. Shell scripting offers simplicity and built-in features conducive to log monitoring tasks. However, a Python script could achieve the same outcome.
- Maintain a list of servers as an array.
- Iterate through the servers, logging in via SSH using PEM files.
- Check the logs for errors within the last 15 minutes using the
grep
command. - If errors are found, send email and Slack alerts. Additionally, gather system performance metrics at that time using
top
,free
, anddf -h
commands. - Run this script every 10 minutes using a cron job (
crontab -e
). Cron expression:*/10 * * * *
.
- Extra Feature: Extend the script to trigger an alert if CPU, memory, or disk usage exceeds 85%, even in the absence of errors in the logs.
- False Alarm Handling: Implement a counter flag to mitigate excessive alerts. Upon the first alert, suppress further alerts for the next 30 minutes (
SUPPRESS_ALERTS
).
Consider implementing a temporary fix, such as restarting the affected service, once the root cause of the error is identified. This can be integrated into the existing script.
- Created a base script (
base_script.sh
). - Utilized ChatGPT to enhance the base script by providing input.
- ChatGPT suggested going with a text file that will have the list of servers instead of having it as an array.
- Utilized ChatGPT to identify limitations of the script, aiding in understanding edge conditions.