Lead Site Reliability Engineer
Remote or San Francisco
Along with our backend engineering team, you will be a driving force behind the system infrastructure powering our GraphQL Cloud platform currently on GKE. You will own the roadmap to scale the infrastructure for fault tolerance, robustness, performance, and security. You will ensure proper monitoring, pager rotation, backup strategies, and recovery plans are in place to meet the enterprise SLA.
GraphQL is taking off in the industry, and you'll help us build the scalable and reliable system that will support the use of GraphQL across multiple micro-services and applications while handling a large volume of data ingestion loads from our customers as their GraphQL usage increases.
What You'll Do
- Design, implement, and simulate a complete disaster recovery plan.
- Make sure all of our production systems have proper monitoring and alerting in place.
- Design and coordinate a backup strategy that covers all of our critical data.
- Own our pager rotation and on-call scheduling for production alerts and critical support tickets.
- Contribute to new feature designs to make sure there is a performance and reliability element to the technical plans.
- Edit our Kotlin backend code to improve things like logging, monitoring, performance, etc. For example, help us embrace structured events.
- Maintain our Terraform configurations, Kubernetes files, and other deployment configuration tools and extend them to support new use cases, features, production environments, etc.; further and maintain their integration through Gradle.
- Help us establish and track SLOs and SLAs for Engine and its components, make sure our systems are built in line with them, work with our product team to prioritize related work within our backlog.
- Write and maintain docs that answer all of the questions we regularly get from customers about architecture, data retention, internal policies, etc. that we can distribute to customers and sales prospects. Also, work with our Security Architect to maintain a list of Q&As around security and PII for sales prospects.
- You've operated production systems at scale in the past and know what a world-class ops process and culture look like.
- You love setting up infrastructure as code. You love automation and you love learning new tools, languages, and technologies (e.g. Kotlin, Gradle, GraphQL, Kubernetes). You like not only finding problems, but fixing them too.
- You're pragmatic — you know how to make tradeoffs between different design points that optimize for overall business goals, not just a technical result.
- You'd be excited to teach the rest of the backend team how to do SRE work themselves.
- You elevate the team around you.
You can do this work from our San Francisco headquarters, or anywhere else in the world.
Apollo is proud to be an equal opportunity workplace dedicated to pursuing and hiring a talented and diverse workforce.