Site Reliability Engineer
San Francisco, CA /
About the role:
Site Reliability Engineers (SREs) are responsible for keeping all user-facing services and other Anyscale production systems running smoothly. SREs are a blend of pragmatic operators and software craftspeople that apply sound engineering principles, operational discipline, and mature automation to our environments and the Anyscale codebase.
Anyscale provides an application development platform for developers to build distributed applications. We’re commercializing a popular open source project called Ray, which is a framework for distributed computing as well as an ecosystem of libraries for scalable machine learning. Our goal is to build a standardized platform for distributed computing. Ray was developed at UC Berkeley by Robert Nishihara and Philipp Moritz, under the guidance of Ion Stoica and Michael Jordan, and the four of them have co-founded Anyscale. The company raised a $20.6M Series A funding led by Andreessen Horowitz (a16z) with participation from NEA, Intel Capital, Ant Financial, Amplify Partners, 11.2 Capital, and The House Fund.
With Ray, we're making it easy to program at any scale (from your laptop to the datacenter) by providing easy-to-use, general-purpose, and high-performance tools. In addition, we are building a rich ecosystem of libraries (for reinforcement learning, hyperparameter search, experiment management, machine learning training, prediction serving, and more) on top of the core distributed system so that users can rapidly build sophisticated applications. Help us build the future of software development.
We are looking for passionate, motivated people who are excited to build tools to power the next generation of cloud applications.
As part of your role you will:
- Run the infrastructure for Anyscale
- Design, build and maintain core infrastructure pieces that allow Anyscale scaling to support hundred of thousands of concurrent users
- Improve the deployment process to make it as boring and reliable as possible
- Debug production issues across services and levels of the stack
- Plan the growth of Anyscale's infrastructure
- Be on rotation to respond to Anyscale availability incidents and provide support for service engineers with customer incidents
- At least 2 years of relevant work experience
Must be willing to work onsite in our office.
We are excited to build a diverse team and encourage all to apply!