DevOps/Site Reliability Engineer
San Francisco, CA
R&D – Engineering
About the Role:
PlanGrid is looking for a Site Reliability Engineer to join our rapidly growing Data Center Operations team.
The Data Center Operations team ensures infrastructure uptime, as well as the provisioning and management of AWS cloud resources for PlanGrid engineers. A big part of our job is enabling developers to have visibility into their service’s performance by means of metrics, traces, and logging. Our growing team handles complex architectural projects to allow us to grow internationally, such as datacenter-level regional disaster recovery and eventually implementing self-healing infrastructure across many AWS regions.
In the last year, we’ve transitioned the company over to self-hosted Kubernetes from a legacy Heroku architecture, built high-availability and resiliency around database clusters through teardown infrastructure testing, written a Fluentd daemonset to log all activity in our customer-facing pods in Kubernetes, implemented Go-based autoscalers for our EC2 instances, and contributed upstream to Spinnaker’s codebase.
We adhere to a devops methodology (as opposed to old-school operations) where developers -- not operations people -- are responsible for their code’s reliable operation and where developers are empowered and trusted to make the changes necessary for reliability. Our work touches every layer of infrastructure, so we are looking for engineers with a broad range of operations and development experience, especially people who define success in terms of SLOs, SLIs, and SLAs, who care deeply about observability in distributed systems, and who have experience scaling out cloud systems to multiple regions worldwide.
DevOps and systems experience is highly valued; If you’ve gotten your hands dirty with package and configuration management, infrastructure-as-code principles, Kubernetes, AWS, Linux and security, PostgreSQL replication, and know your way around Docker, bash and Python, we’d love to talk with you.
You should be passionate about getting in front of problems instead of waiting until things are on fire. If you dream of stability, love metrics, communicate well, document your code, and love building reliable systems that hum along and take care of themselves, we want you on the team.
Our responsibilities include:
- Maintain/upgrade our Spinnaker + Kubernetes CI/CD pipeline, and the tooling that makes it all work, in a sane and reproducible way
- Automate infrastructure deployments with Cloudformation and Saltstack to help us go multi-AWS region
- Build observability into every aspect of our production infrastructure
- Participate in on-call rotations and be a model of how to manage incidents
- Reduce RPO/RTO for our S3, RDS, redis, mongodb, etcd and PostgreSQL instances
In your first 6 months on the SRE team, you will:
- Help us implement and automate a multi-region datacenter failover with as little customer downtime as possible
- Move us closer to a world of rigorously tested immutable infrastructure, where all infrastructure is tested before it ever gets deployed
- Programmatically make secrets management painless and easy across distributed services
- Improve observability with distributed tracing for all requests from client to CDN to load balancer to cluster and back again
- Help developers smoke-test better by bringing canary analysis and automated scale testing into their world
- Located in San Francisco’s Mission District just one block from BART, among local shops, bars, and restaurants
- Flexible vacation
- Dog-friendly office
- Clipper Cards (for public transportation) funded by PlanGrid
- Construction site tours of the biggest projects in San Francisco using PlanGrid
- Volunteer time off: We encourage employees to give back to our local communities. We organize volunteer days and have worked with organizations such as Glide, SF/Marin Food Bank, Muttville, Family Dog Rescue, and Bryant Elementary School (as part of PlanGrid’s commitment with Circle the Schools).
- Catered lunches
- Premium medical, dental, and vision coverage for full-time employees and their dependents
- Office is wheelchair accessible
- We provide parental leave for both parents
PlanGrid is the leader in construction productivity software. Used on more than 1 million projects around the world, PlanGrid's value extends over numerous phases of construction, building a massive and accurate history of every jobsite through everyday use that creates a data-rich record set at turnover that is essential to long-term operations.
PlanGrid is the first construction productivity software that allows contractors and owners in commercial, heavy civil, and other industries to collaborate easily from their mobile devices and desktop. PlanGrid is used in more than 79 countries by thousands of customers including DPR, Granite, NVIDIA, Target Corporation, and Tutor Perini. PlanGrid was a member of Y Combinator’s 2012 Winter Class, and has secured over $69 million in funding from Sequoia, Tenaya Capital Founders Fund, GV, 500 Startups, Box, Northgate, and Spectrum 28.
For more information, please visit: https://www.plangrid.com/.
PlanGrid is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, gender expression, national origin, age, protected veteran or disabled status, or genetic information
As part of GDPR compliance procedures, we have posted our Recruiting Privacy Notice on our website.