Job title : OM Bank - Site Reliability Engineer
Job Location : Gauteng, Cape Town
Deadline : January 30, 2026
Quick Recommended Links
- Jobs by Location
- Job by industries
Job Description
OM Bank is currently looking for a site reliability engineer to join OM Bank platform team. The candidate will be responsible for maintaining the OM Bank platform, including first line support for the platform’s technical services and managing service outages through the incident management process.KEY RESULT AREAS
First line support for all services that comprise the platformManaging the incident management process for production incidents including detection, triaging, resolve and driving continuous improvementsMaintain the production readiness score card defined in terraform to ensure checks are working as expected and responsible for adding new checks to the scorecard workflowCreating and maintaining monitors in datadog that improve observability across the platformEngagement with the wider OM Bank product and build team to ensure alignment to the observability standards defined by the platform teamDesigning and implementing enhancements to the platform that contribute towards reducing MTTR (mean time to recovery)Designing and implementing automation initiatives including self-service capabilitiesImplementing Service Level Indicators & Objectives for the platformImplementing and maintaining datadog dashboards for the platformDefining and maintaining baseline monitors to be used by product teamsMaintaining the observability repository that contains all service definitions and observability related configurationsMaintaining the feature flagging repository containing all feature flagging definition for product teamsMaintaining Pager Duty definitions and overall administrationFine tuning monitors to ensure alerts are triggered appropriatelyLeading an action center during a production incident, fostering collaboration across the bank to resolve the outageAdvising product and platform on engineering best practices to ensure services are built with observability and scalability from the startMaintaining overall platform health by monitoring key metricsMaintaining and extending the SRE API written in python and deploy to KubernetesROLE REQUIREMENTS
Bachelor’s degree in computer science, electrical or electronic engineering, Information Technology, or relevant field7+ years of software and platform engineering experience building and supporting scalable services3-5 years experience in writing infrastructure as code (Terraform, AWS CDK, Cloudformation)Solid experience using observability platforms like DatadogExperience with microservices architecture and Restful APISolid Kubernetes experiencing displaying end to end deployment and maintenance of clusters including designing and building infrastructure as code required to deploy the cluster and required cloud resources that support the clusterExperience with Kubernetes custom resource management and deploymentSolid experiencing deploying Kubernetes resources using Helm ChartsExperience in fine tuning Kubernetes HPA configsModerate experience using go / python programming languageSolid experience using GitOps and general git based operationsSolid infrastructure as code background displaying experience in designing, implementing and maintaining IAC design patterns that manage large scale cloud environment.Solid AWS experience, displaying advanced understanding of cloud architecture and maintaining distributed systemsExperience maintaining queuing systems like AWS SQS and event streaming platforms like KafkaExperience supporting mobile applicationsSkills
Action Planning, Application Development, Business Process Design, Computer Literacy, Data Management, Data Modeling, Evaluating Information, Identifying Customer Needs, Information Technology (IT) Support, Market Analysis, Oral Communications, Product Development, Technical Support, Technical Troubleshooting, Test Case Management, User Requirements Documentation, Web DevelopmentCompetencies
Business InsightCollaboratesCourageCultivates InnovationDecision QualityDrives ResultsEnsures AccountabilityManages ComplexityClosing Date
01 November 2025ICT jobs