Site Reliability Engineer

PT Porto, Portugal
Farfetch - Technology – Engineering /
Full-time /
Hybrid
FARFETCH exists for the love of fashion. Our mission is to be the global platform for luxury fashion, connecting creators, curators and consumers.
We're a positive platform for good, bringing together an incredible creative community made up by our people, our partners and our customers. This community is at the heart of our business success. We welcome differences, empower individuality and celebrate diverse skills and perspectives, creating an inclusive environment for everyone. We are FARFETCH for All.

TECHNOLOGY
We're on a mission to build the technology that powers the global platform for luxury fashion. We operate a modular end-to-end technology platform purpose-built to connect the luxury fashion ecosystem worldwide, addressing complex challenges and enjoying it. We're empowered to break traditions and revolutionise, with the freedom and autonomy to make a difference for our customers all over the world.

PORTO
Our Porto office is located in Portugal's vibrant second city, known for its history and its creative yet cosy environment. From Account Management to Technology and Product, whatever your skills are, you'll find your fit here. You can have an informal meeting in the treehouse or play the piano in your lunch break!

THE ROLE
At Farfetch, the Site Reliability Engineer (SRE) is responsible for ensuring the reliability, availability, and performance of the company's website and applications. This role involves close collaboration with both the development and operations teams to build and maintain a scalable and robust infrastructure that supports Farfetch's business objectives. As a Site Reliability Engineer, you will be part of a team that serves as a bridge between our department and the Infrastructure department. SREs in this position have the autonomy to explore and promote reliability best practices across the organization, acting as consulting partners for all tech-related areas.

WHAT YOU’LL DO

    • Design and implement highly available and scalable systems, ensuring the reliability and performance of the company's website or application
    • Collaborate with cross-functional teams to define and establish service level objectives (SLOs) and service level agreements (SLAs) for critical systems
    • Monitor systems and applications, proactively identifying and resolving any performance bottlenecks or availability issues
    • Develop and maintain monitoring tools, alerts, and dashboards to provide visibility into system health and performance
    • Conduct post-incident analyses to identify root causes and implement preventive measures to avoid future incidents
    • Automate repetitive tasks and processes to improve efficiency and reduce manual intervention
    • Create and maintain documentation for system architecture, configuration, and troubleshooting procedures
    • Perform capacity planning and resource allocation to ensure optimal system performance and scalability
    • Collaborate with development teams to implement and deploy new features and enhancements, ensuring they meet reliability and performance standards
    • Stay up to date with industry best practices, new technologies, and emerging trends in site reliability engineering.

WHO YOU ARE

    • General knowledge of operating systems (Linux and Windows)
    • Experienced in designing, analyzing, and troubleshooting large-scale distributed systems
    • Experienced programming in at least one of the following languages: C#, Java, or Python. Other scripting languages are also a plus
    • Experience with configuration management tools like Ansible, Puppet, Chef or Salt (preferably Salt)
    • Familiarity with cloud platforms like AWS, Azure, or Google Cloud (preferably Azure)
    • Understanding of basic networking principles and protocols (TCP/IP, HTTP, DNS, etc.)
    • Knowledge of containerization technologies (Docker, Kubernetes) and orchestration tools
    • Expertise in monitoring and logging tools such as Prometheus, Grafana, ELK stack, or Splunk
    • Strong problem-solving and troubleshooting skills, with the ability to analyze and resolve complex technical issues
    • Excellent communication and collaboration skills to work effectively with cross-functional teams. (You have to speak English)
    • Strong attention to detail and ability to work in a fast-paced, dynamic environment
    • Solid understanding of software development methodologies and DevOps principles
    • Certification in relevant technologies or frameworks is a plus (e.g., AWS Certified DevOps Engineer, Certified Kubernetes Administrator)
    • Familiarity with continuous integration/continuous deployment (CI/CD) pipelines
    • Experience with source control systems such as Git or SVN
    • Experienced in identifying and addressing toil.
At Farfetch, the Site Reliability Engineer (SRE) is responsible for ensuring the reliability, availability, and performance of the company's website and applications.