Senior DevOps Engineer (Kubernetes, MLOps, LLMOps)

Austin, TX
DevOps /
Fulltime /
Remote
We are seeking a highly skilled Senior DevOps Engineer with deep expertise in Kubernetes, complemented by significant experience in MLOps (Machine Learning Operations) and LLMOps (Large Language Model Operations). This role is ideal for someone who has a strong background in managing and architecting SaaS applications in Kubernetes and is passionate about building and optimizing infrastructure to support machine learning and AI-driven applications.

Role

    • The Senior DevOps Engineer will play a critical role in ensuring that our systems are highly available, reliable, and scalable,  - You will architect, build, and monitor cloud-native architectures with Kubernetes and related technologies, particularly in the context of machine learning and AI workloads.
    • You should have a deep understanding of the Software Development Life Cycle, including Continuous Integration and Continuous Deployment (CI/CD) pipeline architecture, particularly as it relates to deploying ML models and AI services in Kubernetes environments.
    • You will assist in the design and operation of critical cloud infrastructure on AWS, with a focus on supporting the unique requirements of machine learning and AI-driven applications. Examples include model training, deployment, and scaling. All of these  examples would be leveraging AWS SageMaker.
    • Collaborate closely with data scientists and ML engineers to create a streamlined, automated build and deployment process for ML models and LLMs in Kubernetes.
    • Implement and manage the infrastructure necessary for the continuous integration, delivery, and monitoring of ML models and AI services, ensuring they are seamlessly integrated into our SaaS applications.
    • Ensure the availability and performance of production systems that run ML-driven services, proactively identifying and resolving issues that may impact model performance or availability.
    • Optimize infrastructure for the efficient training, deployment, and scaling of ML models and LLMs, leveraging Kubernetes, GPU clusters, and cloud-native tools, including AWS SageMaker.
    • Develop and maintain monitoring and alerting solutions tailored to ML and AI workloads, ensuring that both the infrastructure and deployed models are performing as expected.
    • Troubleshoot and resolve production incidents ensuring minimal downtime and quick recovery.
    • Participate in on-call rotation as necessary.
    • Ensure the security and compliance of our production systems and data, with a particular focus on protecting sensitive AI and ML data.
    • Mentor and coach junior DevOps engineers. 

What we value

    • Bachelor's degree in Computer Science, Engineering, or a related field.
    • A minimum of 7 years of experience in maintaining optimal performance of online production environments, utilizing bare metal, cloud, and container technologies.
    • At least 4 years of experience managing production Kubernetes infrastructure, with exposure to cloud vendor Kubernetes solutions such as EKS, AKS, and GKE.
    • Strong experience with Docker for containerization, including creating and managing Docker images and containers
    • Strong experience in architecting and managing SaaS applications in Kubernetes, with specific experience in MLOps and LLMOps.
    • Deep understanding of the machine learning lifecycle, including model training, deployment, monitoring, and scaling, particularly using AWS SageMaker.
    • Experience with MLOps tools and frameworks, such as Kubeflow, MLflow or similar, and their integration into Kubernetes environments.
    • Familiarity with LLMOps, including the deployment and management of LLMs in production environments. - Solid experience in scripting languages such as Python.
    • Experience with Infrastructure deployment and automation tools such as Terraform, CloudFormation, etc.
    • Working knowledge of industry-standard build tooling and CI/CD using GitHub & Github Actions
    • Expertise in monitoring and logging solutions such as Prometheus and Grafana.
    • Good understanding of networking and security concepts.
    • Strong knowledge of Linux systems and shell scripting.
    • Strong communication and collaboration skills, with experience working closely with data scientists and ML engineers.
    • Experience working in an agile environment and understanding of agile methodologies.
    • Certifications such as CKA (Certified Kubernetes Administrator) or CKAD (Certified Kubernetes Application Developer) are a plus
    • Nice to Haves:
    • Experience with workflow orchestration tools like Apache Airflow, particularly for managing complex data pipelines and ML workflows.
    • Experience with GitOps tools such as ArgoCD, for managing Kubernetes deployments through version-controlled repositories.
    • Familiarity with GPU acceleration technologies and their integration with Kubernetes for optimizing ML model training and inference.
    • Knowledge of data versioning tools and frameworks like DVC (Data Version Control) in the context of MLOps.
    • Experience with cloud cost optimization strategies, particularly in environments running intensive ML and AI workloads. 

Technologies we use

    • We are hosted on AWS Cloud and use numerous AWS services and are expanding into Azure.
    • AWS SageMaker is central to our machine learning model training, deployment, and management processes.
    • Terraform, CloudFormation, Ansible, Kubernetes are leveraged for our infrastructure deployment and automation.
    • Industry-standard build tooling and CI/CD using GitHub, ArgoCD.
    • A mix of open-source and proprietary technologies that are tailored to the problems at hand.

What you can expect

    • Enjoy great team camaraderie whether at our Irvine office or working remotely.
    • Work with talented and collaborative co-workers.
    • Thrive on the fast pace and challenging problems to solve.
    • Modern technologies and tools.
    • Continuous learning environment.
    • Opportunity to communicate and work with people of all technical levels in a team environment.
    • Grow as you are given feedback and incorporate it into your work.
    • Be part of a self-managing team that enjoys support and direction when required.
    • 3 weeks of paid vacation – out the gate!!
    • Competitive Salary.
    • Generous medical, dental, and vision plans.
    • Sick, and paid holidays are offered.
    • Stand/ sit workstations at our amenity rich office in Irvine, CA.
    • Casual environment.
    • Kitchen stocked with snacks and drinks on site.
    • Hybrid and Remote options available

$130,000 - $175,000 a year
Please note the national salary range listed in the job posting reflects the new hire salary range across levels and U.S. locations that would be applicable to the position. The final salary will be commensurate with the candidate's accepted hiring level and work location. Also, this range represents base salary only and does not include benefits if applicable.
LeoTech is committed to a diverse and inclusive workforce. We are an equal opportunity employer and do not discriminate on the basis of race, ethnicity, gender, gender identity, sexual orientation, protected veteran status, disability, age, or another legally protected status.