Site Reliability Engineer (Remote)
Waterloo ON, Mountain View CA, or Remote
Mattermost, one of Y Combinator's top 100 companies, provides an open source enterprise-grade messaging platform to the world’s leading organizations that allows teams to collaborate securely and privately anywhere. With over 10,000 server downloads / month our customers include Intel, Samsung, Affirm, The US Department of Defense and more. Our private cloud solutions offer secure, configurable, highly-scalable messaging across web, phone and PC with archiving, search, and deep integrations with hundreds of SaaS and on-premises technologies. Headquartered in Palo Alto, California, our company serves customers around the world with a distributed organization spanning the globe.
We value high impact work, ownership, self-awareness and being focused on customer success. If these values match who you are, we hope you'll learn more about working at Mattermost and come talk to us!
About the Role
Working in open source means your work is publicly visible. Your code will receive both credit and constructive critique from the community. With the right mindset and support these can lead to you a highly positive working environment and making the best engineering decisions of your career. Core committers include highly skilled volunteer developers from the community, staff employed by enterprises deploying and investing in Mattermost, as well as staff employed by Mattermost, Inc.
Read about our end-to-end recruiting process for core committers at: https://docs.mattermost.com/process/developer.html
We are looking for an engineer with demonstrated experience in software development and infrastructure with a focus on ensuring high reliability and scaling of Mattermost’s SaaS offering through building tools, deploying infrastructure and automation.
- Develop tooling and infrastructure to support thousands of customers on Mattermost’s SaaS offering
- Write thoughtful and high-quality code in Go
- Define infrastructure in code with Terraform
- Implement, maintain and tune monitoring and alerting systems
- Build custom tools and services to automatically recover from incidents
- Respond on-call to incidents with quick and effective resolutions
- Deploy applications to and manage Kubernetes clusters
- Write clear and concise plans for incident response playbooks
- Bachelor's degree in Computer Science or related fields, or significant professional software development or DevOps experience
- Strong demonstrable experience in building and maintaining highly reliable services
- Strong experience with SRE and DevOps methodologies
- Experience with or an ability to quickly become proficient in Go
- Familiarity with containers and orchestration systems, like Docker and Kubernetes
- Comfortable working with infrastructure as code tools, such as Terraform
- Ability to be on-call
- Experience with distributed application systems using HTTP, WebSockets, RPC, pub/sub at scale
- Practical AWS experience
- Knowledge of Grafana and Prometheus
- Comfortable with GitHub, Jira, Jenkins, CircleCI
- Experience working in open source communities
We're looking for someone who wants to help us build the future of Mattermost and improve the way the world communicates. The right person in this role has the opportunity to have a huge impact on Mattermost the product, and its many users worldwide, but also on our open source community that has been key to Mattermost's success. If this sounds like you - please apply!