Our client is looking for a Platform & DevOps Engineer
Service Description
The Platform Engineer in the Consumer Centricity Platform Operations team is responsible for the reliable, secure, and stable operation of the organization’s high-availability cloud platform, built on Kubernetes and composed of multiple in-house platform components.
The role focuses on platform lifecycle management, day-2 operations, incident response, and operational excellence, ensuring that customer-facing Web UIs and APIs remain available, performant, and secure 24/7.
The Platform Engineer acts as a technical custodian of the platform, providing a stable foundation on which service teams can safely deploy and operate their workloads.
Primary Objectives
- Maintain platform availability and reliability in accordance with SLOs/SLAs
- Ensure operational readiness of all environments (DEV / TEST / ACC / PROD)
- Provide 24/7 operational coverage for critical platform services (via on-call)
- Ensure the platform is observable, secure, well-controlled and documented
- Execute platform changes, upgrades, and maintenance in a predictable and low-risk manner
Key Responsibilities
Kubernetes & Runtime Operations
- Operate Kubernetes primitives and platform add-ons:
- Ingress controllers
- Service discovery
- Workload identity
- Troubleshoot Kubernetes-related failures:
- Pod lifecycle issues
- Networking problems
- Resource starvation
- Controlled rollouts with rollback plans
Reliability & 24/7 Incident Response
- Participate in the 24/7 on-call rotation for critical services (incident responder)
- Lead or contribute to:
- Incident triage and mitigation
- Root Cause Analysis (RCA)
- Post-incident action tracking and follow-up
- Maintain and improve runbooks and operational procedures
Observability & Monitoring
- Operate (and use) the open-source observability platform
- Ensure effective observability across the platform:
- Metrics, logs, and distributed traces
- Actionable alerts
- Reduced false positives
- Support incident analysis through correlation and telemetry inspection
Change, Release & Maintenance Management
- Plan and execute platform changes
- Follow structured change management practices
- Stakeholder communication
- Ensure platform changes are documented and auditable
Security & Compliance (Operational Focus)
- Operate platform security controls:
- RBAC
- Network boundaries
- Secret management
- Apply security updates and patches to platform components
- Support vulnerability remediation efforts
- Provide operational evidence for audits and security reviews
Automation & Operational Improvement
- Automate repetitive operational tasks where appropriate
- Reduce operational risk through standardization and documented procedures
- Platform as Code approach (GitOps)
Requirements
Technical Skills
Kubernetes (Deep Production Expertise)
- Multi-cluster architecture & lifecycle management
- RBAC & least-privilege design
- Network policies & traffic segmentation
- Stateful workloads & storage strategy (CSI, PV/PVC)
- Autoscaling (HPA/VPA) & resource tuning
- Pod Security Standards
- Admission controllers
- Performance & reliability troubleshooting
- Cluster-level debugging (networking, DNS, scheduling, OOM, crash loops)
GitOps & Continuous Delivery
- ArgoCD (advanced usage)
- App-of-Apps pattern
- Sync waves & hooks
- Drift detection & reconciliation
- Multi-environment promotion workflows
- Git-based deployment strategy with version management
- Declarative platform design with PR-driven changes
- YAML-based CI/CD pipelines with Harness.io
- Secure secret handling in CI/CD (with HashiCorp)
Packaging & Configuration
- Helm (advanced chart authoring)
- Reusable library charts
- OCI-based registries
- Values layering strategy
- Kustomize overlays for multi-environment isolation and strategic patches
Container & Artifact Management
- Docker (secure multi-stage builds, optimization)
- Harbor (RBAC, replication, vulnerability scanning)
- JFrog Artifactory (Docker & Helm registry management)
- Artifact versioning & promotion strategy
Secrets & Security
- HashiCorp Vault for dynamic secrets with CSI integration
- Image vulnerability scanning integration
- Supply chain security awareness
- TLS & certificate lifecycle management
- RBAC governance
Observability & Reliability
- OpenTelemetry (metrics, logs, traces)
- Prometheus or VictoriaMetrics (recording rules, HA setup)
- Loki (log aggregation & LogQL)
- Tempo (distributed tracing)
- Grafana (advanced dashboards & alerting)
- SLI/SLO design & error budget thinking
- Alert noise reduction strategy
Networking (Advanced)
- TCP/IP & DNS fundamentals
- TLS & mTLS concepts
- Kubernetes Services, Ingress & Reverse Proxy concepts
- East-west vs north-south traffic
- API routing & traffic management
- Network Policies implementation
Automation
- Advanced Bash scripting
- Infrastructure automation mindset
Nice to Have
- Kong API Gateway (API routing, plugins, authentication, rate limiting)
- Redis (operational knowledge: deployment, persistence, clustering, backups)
- PostgreSQL (migrations, backups, HA basics, Kubernetes deployment patterns)
- MongoDB (replica sets, backups, Kubernetes deployment patterns)
- Kargo on top of ArgoCD for release orchestration
Operational Skills
- Proven experience in production operations or platform support roles
- Ability to work calmly and methodically under pressure
- Strong troubleshooting skills across distributed systems
- Clear written and verbal communication during incidents and changes
- Flexibility to balance daily operations with long term changes
Ways of Working
- Structured, risk-aware, and detail-oriented
- Comfortable with operational responsibility and accountability
- Strong collaboration with Development teams, Security teams, Product teams
- Documentation-first mindset for operational knowledge
Positioning vs Other Roles
- Not a pure SRE role: focus is stability and operations, not reliability engineering
- Not a pure DevOps engineer embedded in product teams
- The role is the operational owner of the platform, in all environments, ensuring they runs safely and predictably
Additional Information
- Location: Brussels
- Onsite presence: By default, a physical presence on site is required for 2 days per week
- Work regime: Fulltime