Site Reliability Engineer

Melbourne, AU
Stax – Stax
Full Time
Stax is a home-grown start-up revolutionising the AWS cloud experience. Our aim is to provide a platform that automates and streamlines critical tasks to provide a secure and efficient developer experience. Founded by the cloud experts at Versent, Stax has grown to a team of 40 cloud artisans passionate about the quality outcomes we deliver. With over 200 active customers across 3 continents, our customers range from fast food, to luxury, to ASX top 20.

We've cultivated an environment based on creativity. When people who care about their craft are given the freedom to explore possibilities without restriction, amazing things can happen. There are no cool cliques, just a hard-working team generating ideas and devising solutions in a creative, collaborative workspace. 

We're going through a period of massive growth, and we're not slowing down. We're looking for Site Reliability Engineers (SRE) to join our ranks and work alongside a team made of engineers, developers, IM's and PO's. You will have the opportunity to learn from some of the brightest minds in the industry!

The SRE practice at Stax.io

“SRE is what happens when a Software Engineer is tasked with what used to be called Operations.” The SRE practice has been embraced since the early stages of Stax.

The Stax SRE team is a multidisciplinary group of Software and Systems engineers with a focus on reliability, driven by SLOs. 

We’re building a team with engineers from a wide variety of backgrounds, experiences, and perspectives. Diversity, intellectual curiosity, problem-solving, and openness is key to our success. We encourage collaboration, thinking big, and taking risks in a blame-free environment. 

We promote self-direction to work on meaningful projects. At the same time, we also strive to create an environment that provides the support and mentorship needed to learn and grow.

What do SREs do?

·       Observability: build and combine monitoring and data analytics tools (metrics, logging, tracing, big data) to measure our customers’ experience and reflect that back into the product development cycles.
·       Incident Management: technical support and troubleshooting of live production systems, facilitate the response to emergencies, bring the systems back online as quickly as possible. Take part in the on-call roster with our development teams to support the services after regular office hours.
·       Postmortems: bring a blameless culture, learn from failure, provide and implement recommendations to improve reliability, ensure high SLO attainment.
·       Testing & Release Process: ensure the services delivered to customers are reliable, scalable, maintainable, and consequently will meet the SLOs. 
·       Capacity Planning: measure and report on systems capacity to guarantee service reliability over time as load increases or decreases.
·       Development: system architecture and software development projects for the SRE needs, or as part of a development team. Provide development standards, and promote reusability, reduce complexity to improve reliability. Automate, automate, automate.
·       Product: Consult in service architecture and design conversations. Liaise directly with the customer success team. Facilitate the feedback loop between customers and product teams.

How do we do as a SRE?

At Stax.io, we adopted the SRE “kitchen sink” implementation. While the SRE practice can spread across all topics and services, SREs also manage their workload based on SLOs and, ultimately, business needs.

·       SRE Projects: architecture and development work for the needs of SRE foundations. Build SRE services consumed by the Product team.
·       Tooling: Strong focus on implementing tools to assist with the reliability of product development.
·       Embedded: working directly as a developer as part of a product team with a focus on implementing SRE practices and tools.
·       Consultation: act as a consultant across several projects to assist with the end-to-end lifecycle of Stax services.

The SRE Profile

The SRE profile can cover a wide range of personal and technical skills. Whether you can cover them all, or you are willing to learn them, the key is to know our strengths and weaknesses. We encourage diversity, and there is room for everyone to learn. We believe that there is no ideal background, but there is a mindset and a passion for your craft.

Ideal Backgrounds:

    • Site Reliability Engineer
    • DevOps Engineer
    • Cloud Operations
    • Security Operations
    • Software Developer
    • Yak Shaver

Behaviours and Personal Skills:

    • Understand SRE vs traditional Operations team.
    • Be driven by SLOs and error budgets.
    • Take documentation seriously, very seriously.
    • Passionate about learning continuously, the unknown is daily, routine does not exist.
    • Ability to switch context frequently and work well in high-pressure situations.
    • Collaboration and knowledge sharing is critical.
    • Process-driven and methodical, also have a knack for continuous improvement.
    • Technical at heart, ready to roll up their sleeves and get things done.
    • Ability to “fix the hotel while it’s open”.
    • Do not fear complexity while promoting simplicity. 
    • Self-driven, ability to manage their own workload.

Valued Technical Skills:

    • Experience with AWS infrastructure/services.
    • Familiarity with serverless technologies and approaches on AWS.
    • A working knowledge of Web Applications.
    • Experience with automation concepts and tools.
    • Experience with one or more programming languages and development tools.
    • Experience with CI/CD pipelines.
    • Understanding of Observability and experience with monitoring tools for distributed systems/application architecture.
    • Knowledge of System Authentication / Authorisation principles with an appreciation for federation protocols SAML & OAuth.
    • Understanding how to scale a distributed application, as well as monitor, maintain and improve it safely.
    • Understand how to build for failure.
    • Understanding or experience with building scalable systems.
    • Excellent troubleshooting ability – ability to diagnose and resolve problems in high throughput web apps and network services.

Technologies you will be working with:

    • AWS (Amongst other things, VPC, Step Functions, ECS, Lambda and IAM)
    • Buildkite
    • Datadog
    • Honeycomb
    • Docker
    • Keycloak
    • Gatsby
    • MySQL
    • Python Serverless functions and automation
    • Golang Serverless functions and automation
    • Javascript and Typescript Serverless functions, front end and automation
    • Ruby Serverless functions and automation 
This role is ideally based at our offices in Melbourne or Sydney, however we are open to remote work therefore you can be based anywhere in Australia!

#LI-SC1

Our values reflect the way we work. We’re a casual, inclusive bunch, with team members from a variety of backgrounds collaborating as a team to overcome challenges. Everyone is given space to learn and develop their skills and knowledge. We support each other in all ventures, whether attaining a new AWS certification or trying their hand at baking sourdough or brewing beer. We create remarkable experiences for our customers and we treat others the way we would like to be treated.