Successful candidates need to be comfortable with working on weekends, and rotation shifts as and when required.
Key Responsibilities
- Manage high severity incidents and high customer impact incidents focusing on fast recovery
- Champions production resilience and availability, focusing on superior client experience, by working with the operation team and technology development teams
- Drive the implementation of Site Reliability Engineer (SRE) and Chaos Engineering design for all strategic systems
- Drive effective communication between business and technology with regards to production service reliability and performance
- Drive continuous improvements in processes or systems leveraging Site Reliability Engineering methods
- Respond to, evaluate and analyze production incidents to minimize their impact as well as devise innovative solutions to prevent them in the future
- Improve the reliability and availability of systems by gathering hard data, designing systems for increased service reliability and performance
- Provide expert advice and training to our engineers as to which technology solutions and advanced reliability techniques to use on each situation
- Any other ad-hoc duties as assigned by supervisors
Requirements:
- Bachelor's degree in Computer Science or related field
- Around 10+ years of relevant experience
- Experience driving major production incidents and organize incident retrospective meetings
- Experience with Core Java 8, Cloud Foundry and non-relational databases, and Linux, Unix systems
- Experience with high availability, high-scale, and performant systems
- Experience with python and Unix scripting
Interested applicants, please email your resume to Douglas Tan Yu Feng
Email: [Confidential Information]
CEI Reg No: R26160004
EA Licence No: 99C4599
Recruit Express Pte Ltd