
Senior Engineer- Infra Monitoring
- أبو ظبي
- دائم
- دوام كامل
- To ensure the continuous, proactive, and intelligent monitoring of IT infrastructure through the integration and operation of modern observability tools.
- To develop and operationalise machine learning-based anomaly detection mechanisms for early detection of issues across compute, network, storage, and application layers.
- To support incident prevention and reduction of MTTR (Mean Time to Resolution) through predictive insights, automated alerts, and root cause correlation.
- To enhance operational visibility, reliability, and resilience of critical infrastructure components by applying modern data-driven monitoring strategies.
- Design, implement and fine-tune infrastructure monitoring solutions across on-prem and cloud platforms.
- Develop ML-driven anomaly detection pipelines using telemetry data (logs, metrics, traces).
- Integrate observability data into a unified dashboard and alerting platform with meaningful visualisations and thresholds.
- Continuously train and evaluate ML models to reduce false positives and increase signal accuracy.
- Collaborate with incident management teams to define actionable alerts and automated remediation triggers
- Ensure compliance with enterprise standards, regulatory controls, and audit requirements related to monitoring and data collection.
- Maintain documentation of monitoring architecture, detection rules, ML models, and escalation paths.
- Work closely with infrastructure, application, and security teams to improve data ingestion and correlation.
- Contribute to the continuous improvement roadmap for observability maturity (e.g., from reactive to predictive monitoring).
- Mentor junior team members on observability tools, ML practices, and operational excellence.
- Provide out-of-hours support for major incidents when required, as part of a rota.
- Strong knowledge of infrastructure monitoring tools (e.g., Prometheus, Grafana, Dynatrace, Datadog, Splunk, New Relic).
- Deep understanding of telemetry data (metrics, logs, traces) and how they relate to system performance and health.
- Experience with ML models for anomaly detection (supervised/unsupervised learning, clustering, time-series forecasting).
- Understanding of AIOps frameworks and concepts.
- Good grasp of core infrastructure (Linux/Windows servers, VMs, containers, cloud instances).
- Familiarity with networking, databases, storage systems and cloud-native environments (AWS, Azure).
- Analytical mindset with a bias for root cause analysis.
- Effective communicator able to bridge engineering and operations teams.
- Proactive problem-solver with ownership mentality.