Our client is looking for a Site Reliability Engineering
Service description:
We are looking for Site Reliability Engineering service in our Engineering chapter team. The goal is to ensure the reliability, scalability, monitoring, and performance of our on-premises services in the ERA product organization. Responsibilities will include designing, implementing best practices, and managing our infrastructure. The role includes working within cross-functional teams to improve systems and processes and ensure uptime and efficiency.
Responsibilities:
- Design and maintain monitoring infrastructure
- Create custom dashboards, alerts, and visualization solutions
- Implement distributed tracing and log aggregation systems
- Establish monitoring best practices and SLI/SLO frameworks
- Maintain security compliance for on-premises monitoring tools
- Automate deployment and configuration management
- Collaborate with development teams on application instrumentation
- Participate to on-duty rotations
Requirements:
- Core Technologies: Grafana, Prometheus (PromQL), OpenTelemetry, Elasticsearch
- Infrastructure: Linux administration, networking, on-premises security
- Programming: Python, Bash, or Go for automation
- Experience: 3+ years monitoring/observability, 2+ years Grafana/Prometheus in production, strong Linux system administration experience, proven track record with on-premises infrastructure solutions
- Security: Enterprise security practices, compliance requirements; ability to balance technical trade-offs with business needs and prioritize effectively
- On-call: Participation to on-duty rotations (24/7 Incident support)
Key Deliverables:
- Reduced MTTD/MTTR through effective monitoring
- Comprehensive observability across all systems
- Automated monitoring, deployment, and management
- Security-compliant monitoring practices
Languages: English (C1). Extra languages: German, French, Dutch.