Site Reliability Engineer (SRE) - LLM and Machine Learning

London/Remote
Roles we're searching for now: – Software Engineering /
DevOps / SRE /
Hybrid
We are a pioneering technology company specialising in cutting-edge Language Models (LLM) and Machine Learning solutions. We are seeking a highly skilled Site Reliability Engineer (SRE) to join our team and ensure the reliability, scalability, and performance of our LLM and Machine Learning infrastructure.

As an SRE, you will play a critical role in maintaining the stability and efficiency of our LLM and Machine Learning platforms. You will work closely with cross-functional teams to design, implement, and optimize infrastructure, monitor system health, and respond to incidents, enabling our researchers and engineers to focus on innovation.

Responsibilities

    • Infrastructure Design and Automation: Collaborate with engineering and research teams to design, implement, and automate infrastructure for LLM and Machine Learning workloads, ensuring scalability and reliability.
    • Deployment and Configuration: Manage deployment pipelines, configuration management, and orchestration tools to streamline the deployment of models and services.
    • Monitoring and Alerting: Implement and maintain robust monitoring, alerting, and logging systems to proactively identify and resolve issues. Ensure optimal system performance.
    • Incident Response: Lead incident response efforts, investigate root causes of outages, and implement preventive measures to reduce the likelihood of recurrence.
    • Capacity Planning: Perform capacity planning and scaling to accommodate growing workloads and ensure resource efficiency.
    • Security and Compliance: Collaborate with security teams to implement security best practices, vulnerability assessments, and compliance requirements for LLM and Machine Learning systems.
    • Continuous Improvement: Continuously evaluate and improve system reliability, performance, and efficiency through automation and optimisation.
    • Documentation: Maintain comprehensive documentation for infrastructure configurations, procedures, and incident reports.

Requirements

    • Bachelor's or Master's degree in Computer Science, Information Technology, or a related field.
    • Proven experience as a Site Reliability Engineer or a related role with a focus on LLM and Machine Learning infrastructure.
    • Strong proficiency in cloud platforms (e.g., AWS, Azure, GCP) and containerization technologies (e.g., Docker, Kubernetes).Experience with configuration management tools (e.g., Ansible, Terraform) and CI/CD pipelines.
    • Knowledge of monitoring and observability tools (e.g., Prometheus, Grafana, ELK Stack).Scripting and automation skills (e.g., Python, Bash).Excellent problem-solving and troubleshooting skills.
    • Strong communication and collaboration skills.