Site Reliability Engineer

Remote /
Engineering /
Full-Time
Fyusion is a leading machine learning and computer vision company focused on automotive inspections and related applications. Our patented 3D format enables anyone to capture and display interactive 3D images using their smartphone, and enables significant added functionality with deep visual understanding and machine learning-driven analysis.

Founded in 2014, Fyusion is now part of the Cox Automotive family. Our team includes some of the world's top researchers and developers in light field imaging and AI, continuing to push boundaries and innovate at the highest level from our San Francisco research center.

Fyusion is seeking an awesome Site Reliability Engineer to join our Web and Cloud Infrastructure team. We are a close-knit team that enjoys challenges and solving real world problems. You will have a key role in solving those problems, helping to shape our core automation, data processing, and deployment practices. You will leverage deep knowledge of Amazon Web Services, as well as automated build tools such as Terraform and Ansible, to develop and maintain a wide range of infrastructure components—including web stacks, database systems, security tools, and networking/cloud environment configurations. Further, you will proactively seek out system weaknesses and find ways to fix them before they cause production issues using monitoring data, watching trends, and using Chaos Engineering.

We understand this is a complex role, and do not expect you to be an expert in every tool we use. However, we do expect you to be motivated and open to continual self-improvement, adapting to new tools and overcoming new challenges as they come. If you are looking to be challenged, enjoy wearing multiple hats, and thrive in a fast-paced, agile environment, we think you’ll love this role!

Here's what you will be doing:

    • Keeping your assigned site or service up and running or getting it back up and running quickly when failure occurs.
    • Actively troubleshoot any issues that arise during testing and production, catching and solving issues before launch.
    • Automating work including infrastructure needs, testing, failover solutions, failure mitigation, and much more.
    • Monitor and troubleshoot highly scalable and distributed server clusters that perform various functions, from web-servers to machine learning processing.
    • Be on a PagerDuty rotation to respond to availability incidents and provide support for service engineers with customer incidents.
    • Participate and establish best practices in Site Reliability Engineering.
    • Manage code deployments, fixes, updates, and related processes.
    • Work with a close-knit team and brainstorm on the best ways to tackle complex problems in infrastructure, security and monitoring.
    • Provide technical guidance and educate team members and coworkers on monitoring and logging. (Have an interesting idea or solution? Present it!)
    • Automating any software maintenance processes which previously required a manual procedure.

Here's what we are looking for:

    • 3+ years experience with software engineering, software development, or system operations on high available and high traffic environments.
    • Strong experience with Linux-based infrastructures, Linux/Unix administration, and AWS.
    • Experience with databases such as SQL, MySQL, Elasticsearch, Redis.
    • Experience administering linux servers as well as docker based infrastructure (like Kubernetes, EKS, etc.) in a highly available environment.
    • Experience of scripting languages such as Python, Bash.
    • Experience with message broker/queue technologies like RabbitMQ.
    • Experience with modern monitoring, logging and observability tools in complex distributed systems such as with Grafana, New Relic, Splunk, Elastic stack, Datadog, Prometheus, etc.
    • Practical experience with infrastructure-as-code (with tools like Terraform, Chef, Ansible, etc.).
    • Good understanding of cybersecurity fundamentals and best practices.
    • Containerizing and clustering (Dockerfiles, docker-compose, Helm, Kubernetes, etc.)
    • Stellar problem-solving and troubleshooting skills with the ability to spot issues before they become problems.
    • Excellent oral and written communication skills.
    • Process-oriented with great documentation skills.
    • Solid team player!
Here's what we can offer you:

A competitive compensation, health, vision and dental benefits with premiums paid by Fyusion, generous PTO plan, company holidays (including your birthday), and the chance to be part of a pioneering technology team!

Fyusion is currently allowing employees to work from home during the global pandemic, but we are excited to move into our brand new office in SF soon. We will continue to offer these amazing perks when we are all back in the office: commuter benefits, company catered lunches, a fully stocked snack pantry, tons of company off-sites, and a pup friendly workplace.

If you read this job description and saw your name all over this, apply! If you read this, and think that you might need some help hitting all of the points, please apply! We have an entire team who is happy to help and share our knowledge with you.

The benefits do not apply to contract or internship positions.