Platform Incident Manager
Technology – Infrastructure /
Our Mission 🚀
Trainline is the leading independent rail and coach travel platform selling rail and coach tickets to millions of travellers worldwide. Via our highly rated website and mobile app, people can seamlessly search, book and manage their journeys all in one place. We bring together millions of routes, fares and journey times from 270 rail and coach carriers across 45 countries. We offer our customers the best price for their journey and smart, real time travel information on the go. Our aim is to make rail and coach travel easier and more accessible, encouraging people to make more environmentally sustainable travel choices.
Introducing the Platform Delivery organisation 👋
The Platform Delivery team cover all areas of infrastructure, reliability, platform and operations engineering across public cloud and data centres; Windows & Linux builds, deployment & management, CDN configuration, load balancing, PKI and a variety of other technologies that combine to provide the Platform for all other teams to use.
We are a fast-growing company that loves leveraging new technology to build world-class products for our customers.
We run a diverse platform that is hosted on AWS utilising the best of what it has to offer, coupled with our own tooling this allows us to embrace Continuous Delivery, DevOps/SRE and cloud native microservices to their full potential.
You will often find members of our leadership team, as well as our engineering community, speaking at meetups and conferences.
Platform Incident Management are at the forefront of incident response, management and platform availability.
We are looking for an Incident Manager with exposure to or a passion for understanding complex e-commerce platforms and systems. You will be joining an Incident Management Team that are responsible for key operational processes helping to ensure the service we provide is world class, and if problems do occur, they are managed effectively and resolved quickly.
As Platform Incident Manager you will be part of a team that will run incidents, retrospectives and collaborate across all areas of tech and product engineering. As a team we strive to ensure we learn from experiences and push development to prevent further reoccurrence. You will be responsible for driving the restoration of platform services, delivering clear communication of incidents across internal and external stakeholders and contributing to continually improve the reputation of our platform.
We are a relaxed culture but very serious about what we do and how we do it, we want people that can thrive in a high-pressured, highly transactional environment where personal leadership and initiative is valued and rewarded.
· You will be heavily involved in Major incidents relating to Production environments. This includes triggering the initial incident event, leading the rapid response to service restoration and identifying follow up preventative measures
· You will use monitoring tools to collate and report on stability using these to improve upon MTTR (T1, T2 and T3) times by holding teams accountable for their service quality
· You will take ownership for and provide priority support of the retailing and fulfilment systems, ranging from critical incidents to proactively working on preventative measures by learning and questioning the status of the platform, taking ownership to ensure that issues are not forgotten after they are resolved
· You will use your own experience and learning to provide a fresh approach to processes, we want you to think outside the box coming up with innovative and unique solutions, pushing the bar higher each time
· You will participate in a Major Incident Management (MIM) On-Call schedule to ensure that our teams and systems are supported at all times
· You will have a professional approach to internal and external interactions ensuring that you build confidence in the abilities of you and your team
· You will work closely with the teams to ensure escalation, impact assessment and correct management of identified issues happens within SLA’s
· You will work closely with technical teams and lead incident management, stakeholder engagement and reporting for all P1 and P2 incidents
Knowledge & Experience
· Experience in being part of a team in a high pressure, fast moving environment running Incident and Change management
· Excellent interpersonal, relationship building and influencing skills
· Ability to collate and report on a variety of data showing incident impact across a range of areas
· Experienced in the presentation of data across various levels with a proven ability to tailor to the audience
· Experience running retros and ensuring that the root cause is identified without bringing in the blame game
· Analytical approach to decision making with the confidence to be the end point of the chain
· Experience of juggling multiple tasks and priorities in a fast-paced, highly transactional environment
· Excellent communication skills, both written and verbal with a skill at remaining calm under pressure
· Experience of critical incident response in a DevOps or Agile environment within the e-commerce domain
· Experience with monitoring and observability toolsets (New Relic, ELK, Grafana, PagerDuty)
· Exposure to AWS Cloud and basic understanding of core products & services (IaaS, PaaS)
· ITIL v3/v4 Foundation Certificate
Our Culture 🤗
Everything begins with great people, as well as aptitude, we put a heavy emphasis on attitude.
Coaches Over Heroes
- We prioritise the focus on being one team over elevating the heroics of an individual, for us the true heroes are those individuals who are excellent at nurturing, coaching and generous in sharing their knowledge with others.
- Everything that we do takes into account the morale of every member of our team, their opportunities for growth and for participation in exciting challenges.
Mentoring and Learning
- We have a mentoring community that is constantly growing, we provide people with mentors or buddies from various teams.
- We hire awesome people capable of making smart decisions - empowerment is a great enabler of agility.
We value open expression at Trainline, we believe it’s the diversity of experience, backgrounds and perspectives of our employees that makes us who we are. We encourage everybody to play a part in changing the way people travel across the world.