Senior Site Reliability Engineer

New York City
Engineering /
Fulltime /
Hybrid
Kontakt.io is building the platform that care operations run on.

We reduce waste, cut costs, and improve revenue by improving throughput, asset utilization and staff productivity. Our platform uses AI, RTLS, and EHR data to enable self-learning agents to automate workflows, adapt in real-time, and orchestrate all of care delivery operations.

Easy to deploy and scale, it gives a clear picture of spaces, equipment, and people, eliminating inefficiencies and enhancing the patient experience. With measurable 10X ROI and over 20+ use cases, Kontakt.io is the go-to platform for better and faster care delivery operations.

We are looking for a Senior Site Reliability Engineer with strong software engineering expertise to help evolve and support the reliability of our cloud platform. This role is part of our Infrastructure Engineering function and plays a critical role in ensuring the performance, scalability, and availability of our systems and infrastructure.

The ideal candidate brings a deep understanding of software engineering principles applied to infrastructure. Rather than maintaining systems, you will design and build them—developing automation, tooling, and resilient architecture that enable high availability and fault tolerance across our entire AWS-based platform.

As a key technical contributor, you will work closely with engineering teams to define service-level objectives, improve observability, and architect scalable solutions that meet the demanding requirements of real-time healthcare operations. This role requires strategic thinking, a strong sense of ownership, and a commitment to engineering excellence in support of our mission-critical systems.

Responsibilities:

    • Build internal tools and platform-level services to improve the reliability, observability, and scalability of our AWS infrastructure and Kubernetes environments.
    • Design and implement fault-tolerant, self-healing systems that proactively detect and recover from failures.
    • Develop and maintain Infrastructure-as-Code (IaC) using Terraform to automate infrastructure lifecycle management.
    • Lead architecture reviews and performance optimization efforts for high-throughput, real-time systems.
    • Define and own service-level indicators (SLIs), objectives (SLOs), and agreements (SLAs); instrument systems using Prometheus, OpenTelemetry, and Grafana.
    • Automate key reliability processes, including disaster recovery, chaos engineering experiments, deployment pipelines, and auto-scaling mechanisms.
    • Collaborate with engineering and product teams to embed reliability into the software development lifecycle from the outset.
    • Participate in post-incident reviews; implement and track improvements that enhance long-term system resilience.
    • Join an on-call rotation—focused on systems you’ve helped build and harden.

Our requirements:

    • 5+ years of experience in Site Reliability Engineering, Platform Engineering, or Cloud Infrastructure roles in modern, cloud-native environments.
    • Strong software engineering skills with experience writing production-grade code (preferably in Java, Python, Go, or similar).
    • Deep knowledge of AWS infrastructure and services, including VPC design, IAM, ECS/EKS, and networking fundamentals.
    • Proven experience designing and scaling distributed, event-driven systems and understanding their operational failure modes.
    • Hands-on expertise with Kubernetes in production, including multi-region deployments, service meshes, and runtime observability.
    • Experience with CI/CD pipelines, GitOps practices, and infrastructure automation using tools like ArgoCD, Terraform, or Helm.
    • A solid grasp of system performance, reliability engineering, and the tradeoffs between availability, latency, and scalability.

Why You'll Love It Here

    • Own Mission-Critical Reliability – Ensure hospitals and care facilities always stay online with a 99.99% uptime healthcare platform.
    • Scale AI-Powered Infrastructure – Work on real-time automation and self-healing cloud systems that orchestrate care delivery.
    • Drive Big Impact in Healthcare – Help reduce waste, optimize resources, and improve patient care with technology that delivers 10X ROI.
    • Automation-First Culture – Minimize manual ops with cutting-edge automation, observability, and incident response strategies.
    • Join a High-Performing Team – Work with top engineers, AI experts, and healthcare innovators solving real-world challenges
Ready to Build the Future of Healthcare?
Apply now and help scale the platform that care operations run on. 🚀