Senior Site Reliability Engineer
Engineering – Site Reliability
Klaviyo is a profitable startup located right in the heart of downtown Boston. We craft software thousands of ecommerce companies use to grow their business faster. We love tackling tough engineering problems, and push each other to move out of our comfort zone, learn new technologies and work hard to ensure each day is better than the last. We seek out engineers who specialize in certain areas but are passionate about building, owning & scaling new features end to end. We relish breaking through obstacles & technical challenges and moving continually faster.
About the Role
Site Reliability Engineering (SRE) is essentially what you get when you treat system operations as if it is a software problem. The mission of the Site Reliability Engineering team is to ensure uninterrupted service for Klaviyo customers and act as force multiplier for Klaviyo product teams to deliver better software faster.
Klaviyo is a high growth technology driven company and is passionate about the user experience of its application and the well orchestrated operations of its service infrastructure. The SRE team works on its own initiatives to build foundational backend services but also builds tooling and automation to allow product teams to release and scale their software predictably.
SREs are team players and embed themselves within product teams to advance the architecture and performance of software systems and to train their peers in topics such as debugging distributed systems, building self-healing capabilities or eking out every drop of performance possible.
As a Site Reliability Engineer you will have ownership of foundational Klaviyo services and a big impact on our product teams. Klaviyo’s infrastructure, event processing, and team have grown 300% year over year so there are always new skills to learn and technical challenges to solve the right way.
This position is full-time and based in Boston.
- Design, write and deliver software to improve the availability, scalability, latency, and efficiency of Klaviyo’s services.
- Perform quantitative analysis to understand high-impact events that break Klaviyo functionality and manage the cross-functional effort resolve those events
- Solve problems relating to mission critical services and build automation to prevent problem recurrence; with the goal of automating response to all non-exceptional service conditions.
- Engage in service capacity planning and demand forecasting, software performance analysis and system tuning.
- Uncover and advocate for preventative, upstream solutions with internal stakeholders and external vendors and dependencies
- Confidently make informed, data-driven decisions in a fast paced environment with competing priorities
- Identify and drive opportunities to improve operational workflows
- Conduct periodic on call duties
- Educate other Klaviyo engineers on the best practices for building and operating highly reliable systems
- BA or BS Degree in Computer Science, related field, or equivalent experience
- Technical, Engineering or Quantitative background
- Proven experience with Linux (we run Ubuntu) and all layers of the networking stack. You should be confident administering and debugging production Linux systems
- Experience working on team software projects
- Experience in one or more of: Python, Ruby, Go.
- Familiarity with running and scaling distributed software systems (load balancing, high availability, systems monitoring, etc.)
- Expertise in designing, analyzing and troubleshooting high-traffic, large-scale distributed systems.
- Understanding of Unix/Linux systems from kernel to shell and beyond, taking in system libraries, file systems, and client-server protocols along the way.
- Experience with Amazon Web Services (AWS) or similar cloud compute offerings
- Networking: knowledge and understanding of network theory, such as different protocols (TCP/IP, UDP, ICMP, etc), MAC addresses, IP packets, DNS, OSI layers, and load balancing).
- Experience with building and scaling highly-reliable distributed Python systems (we use Django extensively)
- Experience with instrumenting and monitoring production systems (Nagios, Statsd/Graphite, APM, etc.)
- Systematic problem solving approach, coupled with a strong sense of ownership and drive