Site Reliability Engineer - SRE Lille

Lille / Paris
Engineering – Engineering /
Full-time (long term) /
Hybrid
About the job

Scaleway is looking for a Site Reliability Engineer to join our teams.
Reporting to a Lead SRE, you will be responsible to ensure we can reliably serve our products for users around the world. We expect you to have a strong background in development and system administration. Our systems evolve constantly and the tools needed to observe and act to ensure their resilience need to evolve accordingly.

Minimum qualifications

    • Previous experience as a developer in Go, Python or Rust
    • Experience in system programming with usual scripting languages (bash, Python)
    • Demonstrated ability to troubleshoot production systems failures
    • A great attitude and desire to work with a team
    • Passion for incremental improvements on tooling, love all things of automation
    • Experience with Linux systems (Ubuntu/Debian)
    • Experience with cloud environments architecture (baremetal, virtual machines, containers, orchestrators)
    • Good understanding of computer networks: TCP/IP, DNS, load-balancing, IPv6, BGP and network virtualisation
    • Understanding of written and spoken english, capable of writing technical documentation in English, ability to speak english if needed

Preferred qualifications

    • Experience with infrastructure as code and continuous deployment
    • Experience dealing with physical hardware automation
    • Experience with monitoring & logging systems
    • Experience administering relational databases
    • Knowledge of one cloud platform and related use-cases
    • Take initiatives to propose new solutions and defend them
    • Team player, willing to share knowledge, opinions, and participate in regular team rituals
    • Good communication skills and coaching skills

Responsibilities

    • Create or optimize existing tools & documentation that will help identify, diagnose and remediate production incidents, automating as much as possible
    • Troubleshoot high-impact issues working with multiple engineering teams
    • Take on-call responsibilities, mitigate issues encountered in production and secure the best real-time answer to our customers
    • Ensure a high quality of service for our customers by leveraging observability and monitoring technologies
    • Manage lifecycle of products in production 
    • Help implementing best practices in stability, resiliency, scalability, security and performance across our systems

Technical Stack

    • Python, Go, Rust
    • RabbitMQ
    • PostgreSQL 
    • HA Proxy, Nginx, REST APIs / Flask
    • S3 API
    • Sentry, Prometheus, Grafana, ElasticSearch, Fluentd, Kibana
    • Ansible, AWX, Foreman, Salt
    • GitLab, Nexus
    • Ubuntu, Debian, CentOS
    • Jira, Confluence, Slack, GSuite
Location
This position is based in our offices in Paris or Lille (France)