Sr SRE Incident Manager 202394

Pleasanton /
Technology & Operations – Cloud Operations /
Full time
Ellie Mae is the leading cloud-based platform provider for the mortgage finance industry. Ellie Mae’s technology solutions enable lenders to originate more loans, reduce origination costs, and reduce the time to close, all while ensuring the highest levels of compliance, quality and efficiency. Visit ‪ EllieMae.com to learn more.

We are looking for a SRE Incident Manager to join our growing Cloud Operation team. This is a hands-on incident management role to scale the Ellie Mae Cloud Operations for our growing customer base, continue to deliver high performance and availability. We have an unwavering zeal to make our Customers successful. The ideal candidates take pride in restoring production incidents and can leverage their experience to solve complex problems associated with running production environments at massive scale in multi-tenant environments.

The SRE incident manager drives strategic and tactical management of Production Incidents, improve production Quality of Service and metrics for Availability, Scalability, Performance, etc. This is a fantastic opportunity to work and collaborate closely with our Tech Support, Software Engineering, Architecture, Infrastructure, DBA, SRE, DevOps and Cloud Platform teams at Ellie Mae. Partner with our Site Reliability Engineers (SRE) who are responsible for ensuring Ellie Mae services are highly available, reliable, secure and scalable.

Responsibilities & Objectives

    • Takes a central role as Incident Manager during critical incidents focusing on minimizing MTTR & MTTD
    • Participates in After Action Reviews and facilitates discovery of Root Cause.
    • Identifies, evaluates and executes preventive measures to minimize/avoid impact to the customers experience. Proactive v/s Customer escalated.
    • Conduct Root Cause Analysis and drive repair of Problem Records in order to prevent recurrence through to closure including, but not limited to, resolution of product/service defects or design changes, infrastructure changes, or operational changes
    • Partner with multiple SRE pillars and lead by example - contributor more than a delegator
    • Ensure services are designed with 24/7 availability and operational readiness and rigor
    • Technical hands-on knowledge in order to function as a technology leader
    • Operations Escalations and troubleshooting production issues
    • Employ deep troubleshooting skills to improve the availability, performance, and security of Ellie Mae Services
    • Hand-on exposure of proactive monitoring, alerting, trend analysis and self-healing systems
    • Participate in on-call rotations, driving restoration and repair of service-impacting issues
    • Ensure services are designed with 24/7 availability and operational readiness and rigor
    • Present monthly incident availability and operability metrics to cross-org leadership team

Requirements

    • 5+ years of Systems Engineering in 24x7 Production Services environments
    • BS in Computer Science, Computer Engineering, Math, or equivalent professional experience
    • Experience with Datacenter Technologies including Public Cloud (AWS/Azure)
    • Familiarity with Containerization concepts like Docker, and PaaS services on AWS.
    • Experience with elastically scalable, fault tolerance and other cloud architecture patterns
    • Excellent troubleshooter, utilizing a systematic problem-solving approach spanning code, systems, and network theory & protocols (TCP/IP, UDP, ICMP) ability to read a packet capture/tcpdump, etc.
    • Demonstrated experience in designing, analyzing, and diagnosing large-scale distributed systems + Windows Server and/or Linux systems internals (system libraries, file systems, client-server protocols)
    • NoSQL/Docker/Micro-services/Forensic-Analysis experience is a big plus
    • Demonstrated strength in SaaS services, experience in massive scale web operations
    • Exposure to Change Management within an administration role or knowledge of ITIL service support principles.
    • Hand on experiencing JIRA to the fullest potential.
#LI-TM1

Ellie Mae is an equal opportunity and affirmative action employer. Women, minorities, people with disabilities, and veterans are encouraged to apply.

We do not accept resumes from headhunters, placement agencies, or other suppliers that have not signed a formal agreement with us.