Site Reliability Engineering (SRE) Engineer

Cape Town (Remote) /
IT – Platform & Infrastructure /
Full Time
/ Remote
Why you should join dLocal?

dLocal enables the biggest companies in the world to collect payments in 37 countries in emerging markets. Global brands rely on us to increase conversion rates and simplify payment expansion effortlessly. As both a payments processor and a merchant of record where we operate, we make it possible for our merchants to make inroads into the world’s fastest-growing, emerging markets.

By joining us you will be a part of an amazing global team who makes it all happen, in a flexible, dynamic culture with travel, health, and learning benefits, among others. Being a part of dLocal means working with 600+ teammates from 25+ different nationalities and developing an international career that impacts millions of people’s daily lives. We are builders, we never run from a challenge, we are customer-centric, and if this sounds like you, we know you will thrive in our team.

What's the opportunity?

We are looking for a Site Reliability Engineering (SRE) Engineer to join our team! As our Site Reliability Engineering (SRE) Engineer, you will be focused on the design and implementation of systems that are highly resilient, scalable and reliable. You will be part of a talented team that works on mission-critical applications with big customers like Netflix, Amazon, Nike, Facebook & more!

An SRE Engineer asks the necessary questions:
What data do we need in order to understand how our systems are performing?
How do we collect this data?
What patterns are we looking for in the data and what do they mean?
Who should be notified when a certain system is not working properly?
Do we have any systems that we need more data for?

An SRE engineer designs systems and processes to answer the questions above and to provide automated support and response where possible.

What will you do?

    • Develop quality gates based on production-level service level objectives (SLOs) to detect issues earlier in the development cycle.
    • Automate build testing and validation using service-level indicators (SLIs) and SLOs
    • Influence architectural decisions during initial design stages to ensure resiliency and scale at the outset of software development.
    • Design processes, playbooks and checklists for other engineers to follow during and after incidents
    • Write post mortems and perform technical after-action reviews to understand root cause and propose system improvements to reduce overall fault rates
    • Interact with members from almost all teams across the business to understand their monitoring, alerting and SLO / SLA requirements and design systems and processes that ensure we meet or exceed these requirements.
    • Automate the provisioning of monitoring tools and rules with tools like Terraform and Ansible / Chef
    • Design base level requirements for new and existing services to ensure that all dLocal infrastructure and code are monitored consistently and accurately at a basic level.
    • Monitor both the technical health as well as the security health of dLocal infrastructure and systems
    • Optimize signal-to-noise ratio for alerting to ensure we receive only the alerts that are actionable and make sense.

Which skill do you need?

    • Over 3 years’ of experience as SRE Engineer or in a very similar role
    • Experience with monitoring tools such as New Relic, DataDog, Nagios
    • Experience working with tools such as Jira, PagerDuty and Confluence and integrating these tools with automated processing techniques (API integrations)
    • Experience with CI/CD tools such as Github Actions, Jenkins, Spinnaker, ArgoCD or similar
    • Knowledge of security best practices and infosec tooling. (You will be writing systems to monitor for breaches and insecurities.)
    • Strong communication skills
    • Problem-solving skills
    • Detail-oriented person
    • Highly analytical person
    • Ability to collaborate across multi-functional teams
    • Cloud experience (AWS) is highly advantageous (as most systems will integrate with AWS at some level).
    • IaC experience with a tool like Terraform is highly advantageous 
    • CaC experience with a tool like Ansible, Chef or Salt is highly advantageous
    • Database knowledge is highly advantageous (both in terms of how they perform and SQL syntax).
What happens after you apply?

Our Talent Acquisition team is invested in creating the best candidate experience possible, so don’t worry, you will definitely hear from us. We will review your CV and keep you posted by email at every step of the process!

Also, you can check out our webpageLinkedinInstagram, and Youtube for more about dLocal!