Site Reliability Engineer (Vancouver)

Vancouver
Tech – SWAP /
Full-time /
Remote
Gauss Labs is seeking a highly skilled Site Reliability Engineer to join our team in Vancouver. As an SRE at Gauss Labs, you will play a critical role in ensuring our industrial AI platform's reliability, performance, and scalability. You will be responsible for building and maintaining a robust solution that supports our growing business at customer sites. This role requires a high level of technical expertise, a collaborative mindset, and a strong desire to continuously improve systems and processes.

Responsibilities

    • Monitoring and Alerting: Creating and maintaining robust monitoring systems to proactively identify and resolve issues before they impact customers. Implementing effective alerting mechanisms to ensure timely response to critical events.
    • Incident Response: Participating in on-call rotations and leading incident response efforts to minimize downtime and restore service quickly.
    • Automation: Developing and implementing automation tools and scripts to streamline operations, reduce manual effort, and improve efficiency.
    • Capacity Planning: Forecasting resource needs, optimizing resource utilization, and ensuring customers' infrastructure can handle increasing workloads.
    • Performance Optimization: Identifying and resolving performance bottlenecks, optimizing system performance, and improving response times.
    • Collaboration: Partnering with software engineers, data scientists, and other teams to ensure alignment and efficient operations.
    • Customer Focus: Working closely with the AI Program Manager and Technical Account Manager to understand customer issues, provide technical support, and improve customer satisfaction.
    • Continuous Improvement: Driving a culture of continuous improvement by identifying opportunities to enhance system reliability, performance, and efficiency.

Basic Qualifications

    • Bachelor's degree in computer science, engineering, or a related discipline
    • 5+ years of industry experience as a Site Reliability Engineer
    • Experience with cloud platforms (AWS, GCP, Azure), containerization technologies (Docker, Kubernetes), observability and alerting tools (Prometheus, Grafana, ElasticSearch, Jaeger)
    • Experience with scripting languages (Python, Bash)
    • Working knowledge of Github, Github actions, CI/CD concepts
    • Experience in ticket management, issue resolution, and troubleshooting
    • Strong problem-solving and troubleshooting skills
    • Excellent customer communication and interpersonal skills, fluency in verbal and written English

Preferred Qualifications

    • Knowledge of AI/ML infrastructure and workloads
    • Knowledge of big data technologies (Kafka, Flink)
    • Knowledge of database technologies (MongoDB, PostgreSQL)
[Hiring process]
Application review - Phone interview - Virtual onsite interview - VP interview/Core Value interview