Site Reliability Engineer
Technology & Operations – Technical Operations /
Ellie Mae is the leading cloud-based platform provider for the mortgage finance industry. Ellie Mae’s technology solutions enable lenders to originate more loans, reduce origination costs, and reduce the time to close, all while ensuring the highest levels of compliance, quality and efficiency. Visit EllieMae.com to learn more.
SRE team members know that customer success comes first and will provide prompt and timely response to requests.
Note that SRE requests are normally in the context of a specific customer issue, but requests can also occasionally be
broadly scoped in nature (for example, general outages affecting many users). The SRE team is trained to address
platform issues and functional/non-functional product issues
- Resolve Applications Product Issues + first line to troubleshoot private/public cloud infrastructure (system, database, networking) alerts - automated and manually reported.
- Perform health checks Apps/Infra to identify and pro-actively pre-empt issues from occurring (verification, alerts, etc).
- Help restore and resolve system, database, and application performance issues
- Help restore and resolve security alerts/incidents
- Perform maintenance/restart of Apps and App's underlying System Hardware
- Participate in understanding and scheduling of system software upgrades
- Support product releases, including major releases, maintenance upgrades and Request To Release (RTRs), under direction of Release managers.
- Perform infrastructure changes
- SRE team also conducts Daily (Standup) Production Review
Job Requirements (Education, Experience and Skills)
- BS in Computer Science, Computer Engineering, Math, or equivalent professional experience
- 5+ years of Systems Engineering in 24x7 Production Services environment
- 3+ years’ experience supporting a SaaS environment in a public or private cloud infrastructure with solid understanding of cloud architectures
- Deep knowledge of Windows Server or Linux systems internals (system libraries, file systems, client-server protocols)
- Fluency with at least one current generation scripting language used by DevOps professionals (Python, Bash, Perl)
- Experience coding in a higher level languages such as Java or C#
- Seasoned professional in critical incident triage and response
- Effective working under pressure
- Excellent troubleshooter, utilizing a systematic problem-solving approach spanning code, systems, and network
- Demonstrated experience in designing, analyzing, and diagnosing large-scale distributed systems
- Self-starter who can take ownership of technical issues and follow-through to resolution and perform RCA analysis
- Ability to work during US working hours and weekends as required
- Ability to travel to the US, Poland, and Belarus
The ideal candidate will also have:
- Experience creating Terraform scripts and supporting Spinnaker
- Experience deploying and supporting microservices, Kubernetes and Docker preferred
- Experience with network theory and protocols (TCP/IP, UDP, ICMP, DNS, Balancing), ability to read a packet capture/tcpdump
- Security triage and forensic analysis experience
- Experience administrating Elastic search, Cassandra, MySQL, or MS SQL Server
Ellie Mae is an equal opportunity and affirmative action employer. Women, minorities, people with disabilities, and veterans are encouraged to apply.
We do not accept resumes from headhunters, placement agencies, or other suppliers that have not signed a formal agreement with us.