Senior Cloud Operations Engineer
Services and Support – Cloud Ops /
Our Vision is to be the Most Trusted, Flexible and Easy to Use Hybrid Cloud Data Platform. Actian is transforming industries by empowering companies to accelerate application modernization and simplify the Cloud journey. Our customers use the Actian Data Platform to unify their siloed data, explore and securely exchange data to run a variety of analytic workloads that provide real time business insights at a fraction of the cost. We have 24 of the Fortune 100 companies using Actian technology in some of the most mission critical applications that impact your daily life.
We’re looking for a Senior Cloud Operations Engineer to join the Cloud Operations Team. This is a unique opportunity to join a newly formed team who will focus on world-class monitoring/alerting, platform performance, availability, reliability, and capacity planning. The right candidate will have a software development mindset and will automate as much as possible to avoid repetitive tasks. The individual will work closely with Engineering teams to optimize the deployment and monitoring of mission critical, customer-facing systems across private and public cloud environments.
Cloud Operations Engineers are responsible for keeping all production systems at Actian running smoothly. COE are expected to apply sound engineering principles and operational discipline to develop and deliver automation into our environments. Create monitoring and telemetry to gain insight into the patterns that govern our success and allow Actian to deliver on our uptime commitments. Additionally, you would develop and enhance the CD pipeline to deliver successful builds to production. COE is created to help drive operational excellence through automation and monitoring. Working closely with development teams to build reliability directly into our products and architecture. In the Cloud Operations Team, you will be at the center of driving improvements and change across the organization in additional to accelerating our adoption of containers and Kubernetes.
The successful candidate is customer focused, a self-starter, able and willing to work with geo-dispersed teams. This role will also be responsible for mentoring less-experienced staff.
- Monitor and debug issues across the platforms (applications, networks, databases)
- Administer, maintain, automate systems to ensure reliability, resiliency, scalability, and security
- Deploy, maintain, and enhance monitoring solutions and provide technical resolutions and root cause analysis for high severity incidents
- Work closely with Engineering and Software Development teams to design, deploy, and operate components/services that are automated, resilient, and scalable
- Ensures that documented SSAE Policies and Procedures are followed and enforced
- Create, update, and maintain documentation for all configurations for the production environment
- Maintains and ensures the readiness and availability of disaster recovery environments
- Develop and deliver timely reports on service metrics including but not limited to availability, capacity, performance, and latency across all production systems
- Manage a 24x7x365 regional operational team
Must have skills & Qualifications
- Be willing to bring your best to work every day, challenge the status quo, and move the Actian culture forward
- Bachelor’s Degree in Computer Science or equivalent experience related to Information Technology
- 3+ years’ experience as a Cloud Operations Engineer or Site Reliability Engineer managing a SaaS / PaaS / IaaS environment
- Experience managing Linux and Windows Server
- Experience with the configuration and automation toolsets such as Terraform, Puppet, Chef and Ansible
- Experience in monitoring a global Cloud footprint. Hands-on with modern monitoring platforms and time-series databases, such as Grafana, Prometheus, DataDog, or SumoLogic, Nagios, Zenoss
- Experience in the design and/or deployment of Public Cloud technologies (AWS, Azure, GCP)
- Experience in Network Services such as DNS, DHCP, WAN Routing, TCP/IP networking and DNS, LDAP, NFS and SMTP.
- Knowledge of RDBMS systems such as MySQL and SQL Server.
- Experience with containerization and container orchestration especially with Docker, Kubernetes
- Experience in the deployment and management of microservices
- Experience maintaining and managing Spark, Kafka, Tomcat, Cassandra, and MySQL based systems
- Proficient with Python, Bash, SQL or Java
- Requires the ability to write and present effective materials, including presentations, status reports, technical diagrams, and flowcharts
- Requires the ability to use problem-solving techniques, such as root cause analysis, to resolve issues.
- Solid understanding of incident management, change management, and problem management
Nice to have
- Experience working with a globally distributed team
- Understanding of software development lifecycle and CI/CD pipelines
- Certifications in either of the following AWS, Azure and GCP
We value diversity at our company. All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, or any other applicable legally protected characteristics in the location in which the candidate is applying.