Senior Site Reliability Engineer

Toronto
Engineering, Product & Design /
Full-time, Remote /
Hybrid
About the Position: 

Job Title: Senior Site Reliability Engineer
Location: Candidate must be located in Canada or the USA. Our office is located in Toronto, ON, Canada, but the role is remote/hybrid/flexible.
Reports to: VP, Technology

Position Overview:
We seek an experienced and dedicated Senior Site Reliability Engineer (SRE) to join our growing Engineering, Product, Design and Growth team. As a Senior SRE, you will be responsible for designing, implementing, and maintaining scalable and reliable infrastructure and systems, ensuring the highest levels of availability and performance for our platform. The ideal candidate should have a deep understanding of cloud technologies, automation, and a proven track record of building and managing complex distributed systems.

You will also contribute and collaborate with the broader Product, Design, and Engineering team to understand scalability challenges, development lifecycle bottlenecks, and pain points, make informed decisions about our technology, and deliver solutions for continuous development and release that are as frictionless as possible. 

If you are a results-oriented Senior SRE Engineer who takes pride in their work, is obsessed about reliability, performance, security, and quality, and thrives in a fast-paced, collaborative environment, we want to hear from you. We are on a mission to build an industry-leading product on a strong foundation, built by a world-class engineering, product, and design team!

We’re counting on you to:

    • Infrastructure Design and Automation: Design, build, and maintain scalable and reliable infrastructure using automation tools and best practices, focusing on infrastructure as code (IaC) principles. 
    • CI/CD: Build and improve robust CI/CD pipelines for engineers to release with minimal friction and high confidence.
    • System Reliability: Monitor, analyze, and optimize system performance, availability, reliability, and security, proactively identifying and mitigating potential issues before they impact users.
    • Incident Response and Resolution: Lead incident response efforts and training, ensuring timely resolution of incidents and conducting post-mortem analysis to identify root causes and prevent recurrence.
    • Operational Efficiency: Lead best practices and patterns around operating our systems through defining and enforcing rigorous SLAs for availability, performance and security. 
    • Capacity Planning: Collaborate with cross-functional teams to forecast capacity requirements and scale infrastructure to meet growing demand, optimizing resource utilization and cost efficiency.
    • Security and Compliance: Implement security best practices and compliance standards, ensuring the integrity and confidentiality of data and systems.
    • Continuous Improvement: Drive a culture of continuous improvement, implementing new technologies and processes to enhance system reliability, scalability, and efficiency.

Requirements:

    • Bachelor’s degree in Computer Science or a related technical field 
    • You have 5-8+ years of experience in site reliability engineering or a similar role, with a strong focus on building and maintaining highly available, scalable and secure systems.
    • Proficiency in cloud platforms such as AWS, and Azure with hands-on experience in infrastructure provisioning and management tools like Terraform. 
    • Expertise in containerization and orchestration technologies such as Docker, Kubernetes, or similar.
    • Strong scripting and programming skills.
    • Experience with monitoring and observability tools such as Datadog, Prometheus, Grafana, ELK stack, or similar.

Who you are:

    • Proactive and Curious - Excellent problem-solving skills and a proactive approach to troubleshooting and debugging complex distributed systems.
    • Continuous Learner — Possess a growth mindset and a strong commitment to learning and development.
    • Accountable and Autonomous - Self-motivated to identify a problem and independently solve it - taking solutions with some level of ambiguity from conception to release.
    • Team Player - Strong communication and collaboration skills, with the ability to work effectively in a cross-functional team environment.

Our Tech Stack:

    • .NET, AngularJS/jquery, Angular, and TypeScript 
    • [Mobile] Cordova, Java, Objective-C
    • MongoDB, S3, SQS, Lambda - AWS  
    • CoffeeScript, Puma, Ruby on Rails, Postgres - Heroku
    • Bitbucket, Github, Trello, Jira, Slack

Our Perks and Benefits:

    • Unlimited Vacation: We believe you can be highly productive and still have plenty of time for life outside of work.
    • Generous health benefits plan: Coverage starts from Day 1 and includes vision & dental.
    • Choose your device: Are you team windows or apple? You shouldn’t have to compromise, especially if you work more efficiently on a specific operating system. When you join us, you get to pick!
    • Home Office Allowance: $500/year to ensure your home office is set up for optimal comfort and productivity.
    • Health & Wellness Allowance: $750/year to support your health & wellness related goals and hobbies.
    • Learning & Development Allowance: $1000/year to explore a new skill, attend a conference, read some new books, etc.
    • Fully Remote: Work from the comfort of your own home with the choice to access our downtown Toronto office for a change of scenery. 
    • Events & Free Lunches: We prioritize weekly team bonding and monthly company-wide social events with a lunch stipend. We pride ourselves on maintaining a culture where everyone feels engaged, inspired, and excited to come to work every day.