Site Reliability Engineer
San Francisco, California, USA
Engineering – Engineering
ABOUT THE ROLE:
Our SRE team has a wide range of responsibilities, from managing systems to incident management. We focus on empowering the engineering org to ship and operate their own services in production. We do this through testing, automating and monitoring everything from the infrastructure to the application. We see failure as a problem with our processes, and an opportunity to learn and improve.
Automation is your thing. You don’t want to just maintain systems, but write the software that maintains the systems. You’re into Linux and are not afraid to dig deep when troubleshooting application, system and network problems. You care about distributed systems and like making them scale. You usually have ideas about how to improve things and want to work in an environment where you can make that happen.
We don't really care about your level of formal education, mathematical skill and so on. We want to see that you have relevant experience, that you like automating away repetitive work, that you have good attention to detail, an aptitude for learning new skills and that you have empathy for your fellow engineers.
Examples of typical things our SRE team has worked on:
- Implemented infrastructure as code, migrating from hand-provisioned systems on bare metal to fully automated systems on AWS and VMWare.
- Rolled out our disaster recovery site, and ran successful production game days against it.
- Built load-testing infrastructure to help engineers stress test various components against production, and against a staging environment we also built for systems not yet in production.
- Drive capacity planning for the whole org, identify risk areas, and get hands on helping teams implement improvements. We consult with all engineering teams in preparing for our seasonal campaign traffic where we see about 1.5x our previous peak.
The SRE team:
- Our team is responsible for most of the shared systems, such as load balancers, orchestration platforms, service discovery, etc. We’re also responsible for capacity planning and seasonal campaign preparation for the entire organization.
- Most of our application code and automation is written in Python, with Ansible building images and managing live hosts, and Terraform managing infrastructure. We use automated testing and continuously deliver much of our infrastructure changes already, but we have a lot more to do.
- Empathy is one of our most valued skills and we are always trying to maximize developer happiness and productivity, while developing automation and processes to minimize risk to site reliability.
- Our team is distributed between Dublin, Ireland and San Francisco. This position is in San Francisco, and we'd like you to be based there. Some flexibility is required to facilitate cross-timezone work. Once you've settled in you'll have the opportunity to work from home regularly.
We believe anyone can build the life they imagine through online learning. Today, more than 40 million students around the world are advancing their careers and passions by exploring and mastering new skills on Udemy, and expert instructors are able to share their knowledge with the world. Through our global marketplace and our solutions for businesses and governments, we connect people everywhere with the skills they need for success in work and life. We’re a close-knit bunch that enjoys problem-solving and collaboration, and we share a serious belief in the power of learning and teaching to change lives. Udemy’s culture encourages innovation, creativity, passion, and teamwork. We also celebrate our milestones and support each other every day.
Founded in 2010, Udemy is privately owned and headquartered in San Francisco’s SOMA neighborhood with offices in Denver (Colorado), Dublin (Ireland), Ankara (Turkey), and São Paulo (Brazil).
Udemy in the News: