MLOps Engineer (SRE) in PEOPLE

Closed job - No longer accepting applications

We are looking for an MLOps Engineer with a focus on Site Reliability Engineering (SRE) to join a critical artificial intelligence project. This project is aimed at ensuring the reliability, traceability, and availability of machine learning models in production, guaranteeing high availability, low latency, and active monitoring of both business and ML metrics.

Applications are only received at getonbrd.com.

Main Responsibilities

Design and operate observability solutions for ML models in production (monitoring, alerts, traceability).
Develop dashboards and metrics to evaluate model performance, cost, and stability.
Implement tools for structured logging, drift monitoring, data quality, and inference error tracking.
Automate scaling, fault recovery, and self-healing of inference services.
Establish SLAs/SLIs/SLOs for ML pipelines and intelligent services.
Collaborate with data science and product teams to detect and mitigate incidents related to models in production.
Set up rollback policies and blue/green deployments for model versions.
Apply SRE practices such as chaos engineering, stress testing, staging tests, and continuous integration.

Profile Requirements

Minimum of 4 years of experience as an SRE, DevOps, or Platform Engineer in machine learning projects.
Knowledge of model monitoring frameworks such as Evidently, Arize AI, WhyLabs, or similar.
Proficiency with tools like Prometheus, Grafana, ELK/EFK, OpenTelemetry, or Datadog.
Experience with orchestrators such as Airflow, Kubeflow, or experiment tracking tools (MLflow, Weights & Biases).
Strong knowledge of Kubernetes, Docker, Helm, and infrastructure-as-code tools (Terraform, Pulumi).
Experience with CI/CD for ML pipelines (testing, validation, rollback).
Ability to automate processes, monitor systems in real time, and respond to critical incidents.
Strong collaborative skills to work closely with data scientists and product teams.
Attention to detail, resilience in high-pressure environments, and a mindset focused on continuous improvement.

Nice-to-have (non-mandatory)

Experience operating models on Alibaba Cloud and configuring observability in that environment.
Familiarity with strategies such as canary deployment, shadow testing, and controlled experimentation.
Knowledge of explainable AI frameworks and model auditing.
Previous experience in high-transaction environments such as banking, accounting, payroll, or logistics.

Benefits

Work modality: Remote.

Project duration: 1 year, with the possibility of extension.

GETONBRD Job ID: 55813

Computer provided PEOPLE provides a computer for your work.

Remote work policy

Locally remote only

Position is 100% remote, but candidates must reside in Mexico, Canada or United States.