Site Reliability Engineer

Distributed (US or Canada) /
Engineering /
Full-time
Honeycomb is built to help engineering teams deeply explore and understand their own production systems — in real time. It's a service for the near and present future, where distributed systems are the new default, every service is a platform, and empowered generalist software engineers are the new ops. We are passionate about consumer-quality developer tools and excited to build a product that raises our industry's expectations of what our tools can do for us.

As a Site Reliability Engineer (SRE) you will focus on our expectations around availability, correctness, and performance while building tools and sharing expertise with the team to ensure our service continues to meet those expectations as it scales. The work will cover a wide area, from directly improving our core services to oncall and incident analysis, education around resilience and scaling, and feedback into the product itself. 

Honeycomb believes strongly in code ownership - engineers are responsible for the code they ship. In this role you would help the rest of engineering understand the reliability contexts in which they own their services, not take care of services for them. You will be on the Platform Engineering team and split your time between SRE-focused work and other projects and tasks from Platform team, ranging from infrastructure through our API and storage layer.

Many startups struggle with availability and scaling, and we're proud to have done well on these axes so far; we’ve invested in production excellence & ownership since the beginning. We want your help keeping it that way. We are growing the company and introducing chaos in the form of new traffic, customers, features, and engineers. It’s time to ensure this rapid expansion keeps that strong focus on production engineering!

This role would be a good fit for someone who:

    • Can debug both automated and human processes. The systems that let our service run are both technical and social. Building processes to facilitate effective responses to service failures from the engineers running these services gives us room to automate responses.
    • Can work in both software engineering and operations. Deeply understanding how internet services run and scale is critical to ensuring their reliability. Building systems to help them scale is critical to helping the team move quickly. SRE must live in harmony with both worlds to succeed, pinch hitting across our stack from the engine itself, our automation around it, and the configuration of our infrastructure. The ideal candidate would probably enjoy a bit of variety in their work and not find it too distracting to help out with other projects from time to time.
    • Can find balance in all things. Distributed systems are complicated creatures and sometimes need complicated tools to support them. Sometimes the added complication is not worth the benefit. This role can help cast a critical eye on balancing complication and simplicity, building and buying technology, spending innovation tokens where needed and choosing boring technology when it’s not.
    • Enjoys teaching and practice. Computers break and it’s up to us to keep things running when they do. Running game days, chaos engineering, and trainings around incident response is part of building a resilient team and service with as little toil as possible. We want you to help us do it better. We want EVERYONE sleeping soundly through the night, including you.
    • Has some experience with stateful services. Keeping data is hard - keeping data in custom-built data storage layers is doubly hard. We need to apply the principles of SRE to help scale, tune, and maintain our main storage and query engine.
SRE is a title that means a lot of different things at different companies. Honeycomb is focused on understanding production systems in order to reduce toil and delight engineers - SRE at Honeycomb is focused on bringing together all aspects of running a reliable service, from infrastructure, through instrumentation and tooling into the product itself and the team of people that build and run it. SRE is a force multiplier, enabling engineers to most easily and effectively own their own products.


Let’s do this

We're building a diverse and inclusive workplace where we learn from each other. We hire adults. We value transparency, autonomy, experimentation, and kind, direct feedback. We welcome nontraditional candidates, and people of all backgrounds, experiences, abilities and perspectives. We're an equal opportunity employer and our hiring process is designed to put you at ease and help you show your best work; if we are doing a poor job of this at any time, please let us know. Come build and maintain great things with us.