Software Engineer (Infrastructure Reliability)
San Francisco, CA /
About the role:
As a Site Reliability Engineer, you will be the go-to person for keeping all user-facing services and other Anyscale production systems running smoothly. Anyscale runs on top of certain cloud components, and you will be responsible for developing a unified perspective on how these cloud components are used across the company. This includes processes for provisioning, negotiating prices, managing costs, seeing opportunities for teams to reduce wastage by finding applications across the company. You will apply sound engineering principles, operational discipline, and mature automation to our environments and the Anyscale codebase as we scale.
As part of this role, you will:
- Develop a unified perspective on how these cloud components are used across the company
- Ensure that the deployment methodologies support the company's goals as far as reliability
- Build systems that allow the team to understand what’s happening in production so that when the is an issue we can identify it quickly. This involves helping to build common observability infrastructure for metrics, logging ,and tracing
- Build systems that monitor at different levels, and facilitate teams adding to monitoring and alerting
- Build the testing infrastructure so that the rest of the team can write tests and focus on getting the tests right
- Build tools to measure if the team is meeting service level objectives, and defining what the SLOs should be across the organization
- Set up on-call systems and best practices for on-call for the rest of the organization
- Be on call and making sure that the right thing happens if other people on-call do not respond
- Coordinate how the team creates services that run on the cloud, and manage things like knowing who deployed a service and when and who to call when something goes wrong
We'd love to hear from you if have:
- At least 3 years of relevant work experience
Anyscale provides an application development platform for developers to build distributed applications. We’re commercializing a popular open source project called Ray, which is a framework for distributed computing as well as an ecosystem of libraries for scalable machine learning. Our goal is to build a standardized platform for distributed computing. Ray was developed at UC Berkeley by Robert Nishihara and Philipp Moritz, under the guidance of Ion Stoica and Michael Jordan, and the four of them have co-founded Anyscale. The company raised a $20.6M Series A and a $40M Series B funding from Andreessen Horowitz (a16z), NEA, Foundation Capital, Intel Capital, Ant Financial, Amplify Partners, 11.2 Capital, and The House Fund.
With Ray, we're making it easy to program at any scale (from your laptop to the datacenter) by providing easy-to-use, general-purpose, and high-performance tools. In addition, we are building a rich ecosystem of libraries (for reinforcement learning, hyperparameter search, experiment management, machine learning training, prediction serving, and more) on top of the core distributed system so that users can rapidly build sophisticated applications. Help us build the future of software development.
Anyscale Inc. is an Equal Opportunity Employer. Candidates are evaluated without regard to age, race, color, religion, sex, disability, national origin, sexual orientation, veteran status, or any other characteristic protected by federal or state law.