Let’s Write Africa’s Story Together!
Old Mutual is a firm believer in the African opportunity and our diverse talent reflects this.
Job Description
OM Bank is currently looking for a site reliability engineer to join OM Bank platform team. The candidate will be responsible for maintaining the OM Bank platform, including first line support for the platform’s technical services and managing service outages through the incident management process.
Key Result Areas
- First line support for all services that comprise the platform
- Managing the incident management process for production incidents including detection, triaging, resolve and driving continuous improvements
- Maintain the production readiness score card defined in terraform to ensure checks are working as expected and responsible for adding new checks to the scorecard workflow
- Creating and maintaining monitors in datadog that improve observability across the platform
- Engagement with the wider OM Bank product and build team to ensure alignment to the observability standards defined by the platform team
- Designing and implementing enhancements to the platform that contribute towards reducing MTTR (mean time to recovery)
- Designing and implementing automation initiatives including self-service capabilities
- Implementing Service Level Indicators & Objectives for the platform
- Implementing and maintaining datadog dashboards for the platform
- Defining and maintaining baseline monitors to be used by product teams
- Maintaining the observability repository that contains all service definitions and observability related configurations
- Maintaining the feature flagging repository containing all feature flagging definition for product teams
- Maintaining Pager Duty definitions and overall administration
- Fine tuning monitors to ensure alerts are triggered appropriately
- Leading an action center during a production incident, fostering collaboration across the bank to resolve the outage
- Advising product and platform on engineering best practices to ensure services are built with observability and scalability from the start
- Maintaining overall platform health by monitoring key metrics
- Maintaining and extending the SRE API written in python and deploy to Kubernetes
Role Requirements
Bachelor’s degree in computer science, electrical or electronic engineering, Information Technology, or relevant field7+ years of software and platform engineering experience building and supporting scalable services3-5 years experience in writing infrastructure as code (Terraform, AWS CDK, Cloudformation)Solid experience using observability platforms like DatadogExperience with microservices architecture and Restful APISolid Kubernetes expertise including end-to-end deployment and maintenance of clusters, designing and building infrastructure as code required to deploy the cluster and required cloud resources that support the clusterExperience with Kubernetes custom resource management and deploymentSolid experience deploying Kubernetes resources using Helm ChartsExperience in fine tuning Kubernetes HPA configsModerate experience using go / python programming languageSolid experience using GitOps and general git based operationsSolid infrastructure as code background displaying experience in designing, implementing and maintaining IAC design patterns that manage large scale cloud environmentSolid AWS experience, displaying advanced understanding of cloud architecture and maintaining distributed systemsExperience maintaining queuing systems like AWS SQS and event streaming platforms like KafkaExperience supporting mobile applicationsClosing Date
01 November 2025, 23 : 59
The appointment will be made from the designated group in line with the Employment Equity Plan of Old Mutual South Africa and the specific business unit in question.
The Old Mutual Story!
#J-18808-Ljbffr