Sr. Site Reliability Engineer
This position is specifically a remote role or based out of our Sevilla office. Want more insight on what it's like to be a part of a distributed team at Bitnami? Take a look at this article, written by our own Director of Engineering, Victor Tuson Palau.
Bitnami is at the forefront of innovation that scales up to the largest production clouds as well as down to laptop development environments. Millions of applications are launched every month with Bitnami technologies.
Our Site Reliability Engineering (SRE) team deploys microservices to clouds leveraging modern practices such as containers, Kubernetes and immutable infrastructure. The SRE team is responsible for the availability and performance of the production infrastructure as well as partnering with the other engineering teams to successfully build, deploy and manage Bitnami’s services. We are all about tools and automation, not toil and firefighting. If you enjoy working with the cloud, containers, automation and instrumentation, you should join our mission to bring awesome software to everyone.
You must bring an understanding of the IT business (typically gained by having built or worked extensively with a private or public cloud); a broad perspective of the cloud industry and where it is headed; and experience in building solutions that scale. You will be collaborating with engineers around the world to bring cutting-edge solutions to market. Working with all of the significant cloud providers and container infrastructures will provide you with challenges and opportunities rarely found elsewhere.
- Design and execute our Kubernetes clusters strategy to help our development teams deliver faster and more reliably
- Drive adoption of Kubernetes and Kubernetes best practices across the company and industry
- Create and/or provision reliable tools and infrastructure that enable rapid iteration amongst the product, research and development teams
- Automate our infrastructure following the pattern Infrastructure as Code Monitor, measure and troubleshoot infrastructure and services
- Optimize business continuity capabilities and drive down incident recovery times
- Capacity planning and managementProvide support during office hoursMentor other members of the team (both inside and outside the SRE team)
- At least 5 years of experience deploying, monitoring and troubleshooting multi-tier SOA applications and distributed systems at scale
- Instrumentation for status and trend monitoring experience (CloudWatch, Prometheus, Graphite, etc.)
- Experience with modern application system log management (Syslog, SumoLogic, Fluentd, Loggly, Splunk, etc.)
- Container or cloud orchestration experience with at least one scheduler (Kubernetes, Docker Swarm, Mesos, etc.)
- Highly developed cloud literacy with strong knowledge of AWS, GCE and Azure
- Broad experience with Linux kernel and shell, TCP/IP and HTTP
- Designing networks and systems for security, encryption, performance and agility
- Backup and restoration automation, business continuity planning and testing
Nice to Haves
- Database administration experience with MySQL replication and high availability
- Knowledge of networking and security best practices with software defined networks
- Experience with big data, streaming and search systems like Cassandra, Hadoop, Spark, Kafka and ElasticSearch
- Competitive salary and stock options
- Both vacation and sick time
- Your choice of machine and hardware
- Benefits vary based on location
Bitnami is a globally-distributed software company, headquartered in San Francisco. With team members in over 10 countries, we've created an incredibly enjoyable and productive distributed environment. We are a bootstrapped, profitable, high-growth company with a high-energy and passionate team. Bitnami was also part of Y Combinator's Winter 2013 batch.
Learn more about our team and what it's like to work at Bitnami by visiting the About Us and Careers pages on our website.
Bitnami is an equal opportunity employer.