Site Reliability Engineer

Raleigh-Durham, NC
Engineering – Engineering
Full-time
SignalFx is built on a highly distributed micro-services based architecture that ingests, analyzes and stores billions of time-series metrics and events - all in real time. If you have an interest or passion for monitoring and large scale time-series data processing we would like to hear from you. Our goal is to continually improve the performance and reliability of this system as it evolves to handle dramatically more data (by orders of magnitude) sent by our rapidly increasing customer base.

Site Reliability Engineers at SignalFx are hybrid software/systems engineers whose overarching goal is to ensure that Production Services are always up and running reliably. They are also responsible for improving Operational Efficiency, Optimal Utilization and System Resiliency of the SignalFx Platform. They own some of the Open Source Software that our platform relies on, and are core participants in every significant engineering effort underway in the company.

Responsibilities

    • Responsible for automating & operationalizing engineering tasks on Backend Services - data migrations, performance tuning, capacity changes, etc
    • Monitor Capacity & Utilization and work closely with the Infrastructure team to orchestrate scale-up/down of Backend Services.
    • Own & operate critical back-end Open Source Services like Cassandra, Kafka, Zookeeper, Elasticsearch.
    • Build tools and design processes that help improve observability and system resiliency of the SignalFx Platform.
    • Triage Site Availability Incidents and proactively work towards reducing MTTR for customer impacting incidents.
    • Partner with Service owners to implement Service Level Metrics & Service Level Objectives that act as service level health indicators.
    • Establish design patterns for monitoring, benchmarking and deploying new features for the backend services.

Requirements

    • BS degrees in Computer Science or related technical field, or equivalent practical experience.
    • 5+ years of experience as a Site Reliability Engineer, Production Engineer or Backend Software Engineer for web-scale or similar platforms. 
    • Coding experience in one or more of Python, Bash, Go or Java.
    • Experience building or operating high performance distributed systems.
    • Experience with one or more OSS technologies like Kafka, Cassandra, Zookeeper or Elasticsearch.
    • Understanding of Unix/Linux systems from kernel to shell and beyond, taking in system libraries, file systems, and client-server protocols along the way.