Site Reliability Engineer
Taipei/Kaohsiung /
KKCompany – Engineering /
Permanent
/ Hybrid
Asia’s leading technology group, KKCompany Technologies (KKCompany), is a leader in software services. Our mission is to build “Freeways to Inspiration” and help industries achieve digital transformation. By creating technology highways with partners, we deliver our services around the world and drive value creation through future technology.
In addition to our flagship brands KKBOX, KKStream, and Going Cloud, our core technologies cover various fields such as music streaming, multimedia, and cloud services. Through a range of products and services, we help customers create commercial value. We also offer software services and solutions to over tens of millions of customers with corporate clients across Asia covering various industries such as telecommunications, entertainment and multimedia, media, education, and fitness centers.
We have over 500 employees across offices in Tokyo, Singapore, Taipei, Kaohsiung, and Hong Kong.
Responsibilities:
- Engage in and improve the services lifecycle from design to deployment, operation, and refinement.
- Support services before the production stage through system design consulting, platform development, capacity planning, and launch reviews.
- Maintain services in the production stage by monitoring availability, performance, resources, and other related metrics.
- Construct and scale systems or platforms through automation and infrastructure as code.
- Practice incident response and blameless post-mortems.
Requirements:
- Experience in one or more: Go, Perl, Python, Ruby, or shell scripting.
- Experience with Unix/Linux administration (filesystems, processes, signals, etc.) and networking (TCP/IP, HTTP, DNS, etc.).
- Experience architecting systems with high availability, reliability, scalability, and security.
- Experience with version control systems (Git, Mercurial, SVN, etc.).
Nice to Have:
- Experience in automating infrastructure configuration (Ansible, Chef, Puppet, Terraform, etc.) and monitoring (Cacti, Nagios, Prometheus, etc.)
- Experience in operating containerized environments (Docker Swarm, Kubernetes, Nomad, etc.)
- Experience in managing distributed systems in cloud environments such as AWS or GCP.
- Experience in analyzing and troubleshooting large-scale distributed systems.
- Experience in technical writing or documentation.