Manager - Site Reliability Engineering

São Paulo
Engineering – Backend Engineering
Full-time
Who we are 

TFG is the largest mobile game company in Latin America, and one of the largest in the world. In 8 years, we have released over 70 games, including hits such as Sniper 3D, the leading FPS game on App Store and Google Play Store, and Colorfy, the world's most popular coloring app. Our games have been downloaded 1 billion times in 125 countries. The team started with two brothers, and now there are around 320 of us – and counting. To build the very best mobile games, we gather exceptional talent in software engineering, art and animation, product design and management, marketing, and data science.

About the Team

An engineer in our team works with a global scale infrastructure and has a high impact on millions of players. To guarantee the best experience possible, we count with several Kubernetes clusters spread around the world and connected to each other. We are in the cutting edge of open-source infrastructure technology; we adopted Kubernetes in production little after the project was launched, and today, we use technologies such as eBPF and Cilium in our network stack. We handle billions of logs daily and have hundreds of nodes and thousands of containers to serve more than 1 million requests per minute. We know this number will only grow and we're looking for engineers that can help with the challenges of provisioning and operating infrastructure at large scale.

About the Role 

TFG Co is searching for infrastructure/site reliability managers to join our team. On the technical side, you'll need to find bottlenecks and solve with ease of performance problems in distributed systems. You'll also make decisions about bigger cloud infrastructure themes, such as deciding when to upgrade its components, when to replace them or even idealize and lead the development of new ones. You'll be responsible for guiding and developing highly-skilled engineers and will work closely with other managers and leaders to improve our processes and ensure we have a world-class engineering organization.

More about you

    • Team focused. You believe in influencing through developing relationships over-exercising authority. You always look for improving your team's health, for example, by balancing operational responsibilities with development.
    • Technically skilled. You have in-depth knowledge about the Linux kernel and have built a career focused on studying how to deploy, operate, and monitor highly scalable systems. You are happy when you contribute to planning and design discussions.
    • Having the best engineers is key to scaling. We look for leaders that thrive when hiring the best engineers and promoting their growth. You believe in providing continuous, thoughtful feedback and in recognising the individual strengths and contributions of your team members.

What you’ll do

    • On a daily basis:
    • Participate in and guide the development of automation and infrastructure as code tools.
    • Guide troubleshooting and incident management in production. Our stack includes: Kubernetes, Kafka, Elasticsearch, Postgresql, Cassandra;
    • Guarantee the team engineers' happiness and productivity;
    • Attract and retain talent.

    • On a weekly basis:
    • Conduct 1:1s with the infrastructure team's engineers;
    • Plan team activities and engineers' allocation;
    • Report status to the director (what was delivered in the previous week and what will be delivered next);

    • On a quarterly basis:
    • Set the team's goals aligned with the company's OKRs.

Current challenges:

    • Short-term (6 months)
    • Grow the SRE/Cloud Infrastructure team;Improve all on-call and monitoring processes;

    • Medium-term (1 year)
    • Create a team for cloud costs control;Create a team for cloud security.

What you'll need

    • Bachelor's degree in Computer Science, Computer Engineering, or equivalent experience;
    • Linux knowledge. You should be able to discuss in detail what happens under the hood (SO, kernel, network);
    • 5+ years of software development professional experience;
    • Experience managing teams of at least 5 people.

Plus

    • Experience with large scale production systems and technologies;
    • Experience with Kubernetes and containers in general;
    • 8+ years of software development professional experience;
    • Experience managing teams with 10+ engineers and being responsible for creating a team from scratch;
    • Project management.
We welcome people from all backgrounds who seek the opportunity to help build the best gaming company, where everyone thrives.