Sr. Site Reliability Engineer
Hi, we're Buoyant.
We build Linkerd—the lightest, fastest, simplest service mesh for Kubernetes. Linkerd is an open source project. It has a thriving community of adopters and it powers the production systems of companies around the world. Companies like Microsoft, Nordstrom, Timescale, Expedia, and many, many more use Linkerd to add critical security, reliability, and observability features to their Kubernetes applications.
We also make Buoyant Cloud, a cloud-based management system for Linkerd deployments that monitors health, takes on the toil of upgrades, data plane management, and more. Buoyant Cloud makes any Linkerd deployment zero stress to operate.
We’re a small company with an incredibly outsized impact on the world. We’re remote-first and fully distributed with team members all over the world. Our competition is fierce: every day, we take on some of the largest companies in the world with essentially infinite dollars and infinite engineers to devote to their competing projects. Our secret weapon? A laser-like focus on solving actual problems for our users coupled with a deep sense of empathy for what it takes to operate a service mesh in production.
Working here is not for the faint of heart.
Imagine taking ultralight, ultrafast L7 "micro-proxies" written in Rust, sticking them next to other people's applications running on other people's clusters, and having them mediate all TCP (and in the future, UDP) communication to and from these apps—no matter what it is. Imagine these proxies upgrading connections from HTTP/1 to HTTP/2, initiating and terminating mutual TLS, retrying requests, emitting fine-grained metrics, issuing CSRs to the local CA every n hours, and so on. And imagine the end user blissfully operating 10,000 of these proxies, as a whole, with an easy-to-use CLI and API. That's a service mesh.
(The best way to learn more is to try it yourself. If you can run Kubernetes, you can get Linkerd up and running in 5 minutes.)
Under the hood Linkerd is extremely sophisticated and deeply technical, but the key to the entire project is simplicity. It's extremely difficult to make a simple service mesh, and very easy to make a complex one. (Heck, look at pretty much any of the other options.) And simplicity in this case means operational simplicity: it doesn't mean that you get a one-click install wizard, it means that whenever Linkerd is running, you can understand what it's doing and predict its behavior. It means no leaky abstractions, no complex tuning, no hidden gotchas. We spend a lot of time and energy in ensuring Linkerd is simple—it's our central promise to our users.
Linkerd is a graduated-tier project of the Cloud Native Computing Foundation, just like Kubernetes, Prometheus, and other defining projects of the cloud native space. It's written in Rust and Go; Go powers (most of) Linkerd's control plane. Rust powers Linkerd2-proxy, the source of Linkerd's true power. We're the only service mesh that uses something like Linkerd2-proxy, and it's a tremendous differentiator.
What's Buoyant Cloud?
Buoyant Cloud is our cloud-based management system for Linkerd. Linkerd runs on your clusters; Buoyant Cloud makes that a zero stress situation. You connect your existing open source Linkerd deployments to Buoyant Cloud and it takes over all the hard parts of actually running software: it monitors their health and alerts you if anything looks weird; it handles upgrades and data plane management; it lets you know if there's something you need to pay attention too, and and lots more. In short: Buoyant Cloud turns Linkerd into a managed service, even though it's running on your own clusters.
Linkerd is open source, but Buoyant Cloud is a commercial product. You have to pay the big bucks to get it.
What are you looking for?
In this role, we're looking for a site reliability engineer who wants to take ownership of Buoyant Cloud's availability and operability.
Buoyant Cloud is a SaaS application. It runs on Kubernetes, in the cloud, and is built on Linkerd (of course!) and other bits of cool cloud native tech. We run multiple Kubernetes clusters; we need to span zones and regions; we deploy new features in an incredibly fast cadence; and we're onboarding new customers like crazy. We need someone to automate, well, almost everything about Buoyant Cloud, and make sure that it's always there for our ever-growing list of customers.
Several of us have SRE experience, but you'll be our first full-time SRE hire. We need someone who has a well-developed SRE playbook to lay the foundation for the future of how we deliver Buoyant Cloud to our customers. That said, we're a small company and everyone wears a lot of hats. You may need to write some Go. You may need to write some bash scripts. You will probably need to learn a lot about Linkerd and you'll get a fascinating glimpse into all the wacky and amazing things our customers do with it.
We’re a team, not a family, but we have families and are the kind of place where work doesn’t get in the way of that. There will be crunch times, but work life balance is important and our p95 work times are pretty darn good for a startup.
This is an extremely technical company that is pushing the boundaries of modern cloud software. As measured by sheer impact per capita we are, in our humble opinion, at the very top of the industry, and you should be up for that challenge.
If that sounds fun, and maybe just a little scary—this might be the job for you.
- Four or more years of professional site reliability engineering experience
- Expert-level understanding of Kubernetes from the operator's perspective
- A vision for the Right Way to do things, tempered with an acceptance that the Right Way is rarely 100% achievable
- The ability to navigate complex software engineering tradeoffs between scalability, maintainability, future-proofing, end-user experience, and just shipping the darn thing already
- Excellent written communication skills
- The ability to work with in a distributed team with members across with different timezones
- A weird but adorable love of software infrastructure
In this role, you will:
- Ensure availability of Buoyant Cloud to our rapidly expanding customer base
- Automate the end-to-end code-to-production pipelines for Buoyant Cloud
- Give and receive critical feedback while maintaining a supportive and friendly environment.
- Develop high-scale reliable test tooling and environments, drawing from experience working on production systems
- Solve problems aka fight with CI/CD tools, Kubernetes, Linux, networking, and the rest of the complex ecology in which Linkerd lives
- Low-ego, collaborative, and results-oriented
- Able to give and receive constructive feedback
- Willing to learn from and mentor teammates
- Passionate, empathetic, and kind
Buoyant is a fully distributed, remote-first company. Applicants from anywhere in the world can apply. However, working hours must overlap substantially with US timezones.