Site Reliability Engineer

PT Porto, Portugal

Farfetch - Technology – Engineering /

Full-time /

Hybrid

FARFETCH exists for the love of fashion. Our mission is to be the global platform for luxury fashion, connecting creators, curators and consumers.

We're a positive platform for good, bringing together an incredible creative community made up by our people, our partners and our customers. This community is at the heart of our business success. We welcome differences, empower individuality and celebrate diverse skills and perspectives, creating an inclusive environment for everyone. We are FARFETCH for All.

TECHNOLOGY

We're on a mission to build the technology that powers the global platform for luxury fashion. We operate a modular end-to-end technology platform purpose-built to connect the luxury fashion ecosystem worldwide, addressing complex challenges and enjoying it. We're empowered to break traditions and revolutionise, with the freedom and autonomy to make a difference for our customers all over the world.

PORTO

Our Porto office is located in Portugal's vibrant second city, known for its history and its creative yet cosy environment. From Account Management to Technology and Product, whatever your skills are, you'll find your fit here. You can have an informal meeting in the treehouse or play the piano in your lunch break!

THE ROLE

At Farfetch, the Site Reliability Engineer (SRE) is responsible for ensuring the reliability, availability, and performance of the company's website and applications. This role involves close collaboration with both the development and operations teams to build and maintain a scalable and robust infrastructure that supports Farfetch's business objectives. As a Site Reliability Engineer, you will be part of a team that serves as a bridge between our department and the Infrastructure department. SREs in this position have the autonomy to explore and promote reliability best practices across the organization, acting as consulting partners for all tech-related areas.

WHAT YOU’LL DO

Design and implement highly available and scalable systems, ensuring the reliability and performance of the company's website or application
Collaborate with cross-functional teams to define and establish service level objectives (SLOs) and service level agreements (SLAs) for critical systems
Monitor systems and applications, proactively identifying and resolving any performance bottlenecks or availability issues
Develop and maintain monitoring tools, alerts, and dashboards to provide visibility into system health and performance
Conduct post-incident analyses to identify root causes and implement preventive measures to avoid future incidents
Automate repetitive tasks and processes to improve efficiency and reduce manual intervention
Create and maintain documentation for system architecture, configuration, and troubleshooting procedures
Perform capacity planning and resource allocation to ensure optimal system performance and scalability
Collaborate with development teams to implement and deploy new features and enhancements, ensuring they meet reliability and performance standards
Stay up to date with industry best practices, new technologies, and emerging trends in site reliability engineering.

WHO YOU ARE

General knowledge of operating systems (Linux and Windows)
Experienced in designing, analyzing, and troubleshooting large-scale distributed systems
Experienced programming in at least one of the following languages: C#, Java, or Python. Other scripting languages are also a plus
Experience with configuration management tools like Ansible, Puppet, Chef or Salt (preferably Salt)
Familiarity with cloud platforms like AWS, Azure, or Google Cloud (preferably Azure)
Understanding of basic networking principles and protocols (TCP/IP, HTTP, DNS, etc.)
Knowledge of containerization technologies (Docker, Kubernetes) and orchestration tools
Expertise in monitoring and logging tools such as Prometheus, Grafana, ELK stack, or Splunk
Strong problem-solving and troubleshooting skills, with the ability to analyze and resolve complex technical issues
Excellent communication and collaboration skills to work effectively with cross-functional teams. (You have to speak English)
Strong attention to detail and ability to work in a fast-paced, dynamic environment
Solid understanding of software development methodologies and DevOps principles
Certification in relevant technologies or frameworks is a plus (e.g., AWS Certified DevOps Engineer, Certified Kubernetes Administrator)
Familiarity with continuous integration/continuous deployment (CI/CD) pipelines
Experience with source control systems such as Git or SVN
Experienced in identifying and addressing toil.

At Farfetch, the Site Reliability Engineer (SRE) is responsible for ensuring the reliability, availability, and performance of the company's website and applications.

Apply for this job