Technical Lead, Site Reliability

Denver, CO /
Product – Engineering /
Full-time (Remote)
Why You Matter

Operational systems and applications that provide security to customer environments require 24/7 availability. The organizations that provide the most reliable operational systems are driven by teams that are passionate about keeping production environments up and available beyond customer expectations. These teams are hyper-focused on monitoring production systems, identifying potential and realized issues as early as possible, attacking and resolving these issues in real-time, and improving tools and processes to prevent future recurrences.

Delivering that level of 24/7 operational availability is paramount to the Red Canary charter of being our customers’ top security ally. As the Production Engineering Tech Lead, you will use your technical expertise to coach and supervise team members troubleshooting and resolution techniques, as well as enhance monitoring capabilities across our customer environments. You will run point on Red Canary’s response to, and resolution of production incidents. You will ensure that clear and consistent status during incidents will go out to both internal and external stakeholders. You will also manage and prioritize tasking for the Production Engineering team.

The team you will lead is responsible for monitoring and maintaining our production systems and applications so that we meet established Red Canary service level agreements regarding availability, uptime, and application response thresholds. Production Engineering is our front-line defense against system issues and outages that impact Red Canary’s ability to deliver the security outcomes our customers require.

Why Red Canary

Red Canary was founded to make security for every business better by protecting organizations around the world from cyber threats. Our combination of market defining technology, processes, and expertise delivered using an innovative SaaS model is preventing breaches every day.

The Red Canary Engineering team builds and operates the platform to deliver unmatched threat detection and response. We process billions of events per day from hundreds of thousands of systems worldwide. We are on the front lines of cybersecurity with unique opportunities to utilize new technologies and solve the hardest problems in cyber security.

Who You Are

You are passionate about working in 24/7 operational environments that leverage cloud and container technologies and services. You use your systems administration skills and experience to address production impacts, either directly or through/with your team members. You use your  software development/ scripting skills and knowledge of Infrastructure-as-Code to improve reliability, observability, and availability of production systems and applications. You understand configuration management best practices and how they enable reliable production systems.

You are diligent about capturing and sharing best-practices around troubleshooting and resolution of issues. You take charge in managing operational incidents, and also enable team members to produce under pressure. You love to use your skills and experiences to coach other team members and grow their skills. You understand the urgency in resolving production issues in real-time, but also establish actions and plans to implement strategic fixes that address root problems.

You understand the importance of clearly communicating (verbally and written) the state of production systems to both internal and external stakeholders. You set expectations for you and your teams work, and are accountable to commitments and deadlines. You have experience performing and documenting root-cause-analysis. You have experience accepting responsibility for maintaining applications and systems that were externally developed.

The ideal candidate has demonstrated success managing 24/7 operational systems and teams. He/she also has experience in defining acceptance criteria for systems and applications entering into production. A strong technical skill-set with infrastructure, cloud, container, and monitoring tools and technologies is required.

What You'll Do

    • Manage Red Canary 24/7 production systems and the team responsible for maintenance.
    • Lead by example by addressing production issues and technically mentor and coach direct and indirect reports.
    • Develop playbooks to troubleshoot and address recurring issues.
    • Develop root cause analysis for internal and external stakeholders.
    • Develop and report metrics describing production system availability, uptime, and responsiveness.
    • Develop and implement tools and processes to increase monitoring of production systems and applications.
    • Manage resources to ensure that high-value tasks are prioritized and completed.
    • Collaborate and coordinate with other team leaders to deploy and accept systems into production environments.

Additional benefits of working at Red Canary include:

    • Exceptional healthcare and dental coverage including fully paid premiums
    • Flexible time off and leave benefits
    • 401k and flex-spending accounts
    • Fitness and phone discretionary stipends

Individuals seeking employment at Red Canary are considered without regard to race, color, religion, national origin, age, sex, marital status, ancestry, physical or mental disability, veteran status, gender identity, or sexual orientation.