OM Bank : Site Reliability Engineer page is loaded OM Bank : Site Reliability Engineer Apply locations Johannesburg Cape Town time type Full time posted on Posted Yesterday job requisition id JR- Let's Write Africa's Story Together!
Old Mutual is a firm believer in the African opportunity and our diverse talent reflects this.
Job Description The Site Reliability Engineer will be responsible for ensuring the reliability, scalability, and performance of our digital banking infrastructure.
You will work closely with software engineers, Platform engineers, and security team to proactively prevent issues, resolve incidents, and optimise system health.
This role requires a mix of technical expertise, automation skills, and operational discipline to deliver high availability and performance for critical banking services.
KEY RESULT AREAS Reliability and Performance Monitoring Implement and maintain monitoring and alerting systems to track key performance indicators (KPIs) for uptime, latency, and system health.
Define and monitor Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to ensure critical systems meet reliability standards.
Incident Management Participate in on-call rotations and lead response efforts to quickly resolve system incidents, minimizing downtime and customer impact.
Conduct root cause analysis for incidents, create post-incident reports, and implement corrective actions.
Automation and Infrastructure as Code (IaC) Develop and maintain automation scripts and tools to streamline operational tasks, including incident response and system provisioning.
Implement Infrastructure as Code (e.g., Terraform, Ansible) to manage and scale infrastructure reliably and repeatably.
Continuous Improvement of CI / CD Pipelines Collaborate with Platform engineering team to enhance CI / CD pipelines, reducing deployment time and improving stability.
Implement canary and blue-green deployments, rollbacks, and automated testing to ensure reliable releases.
Capacity Planning and Scalability Analyze system performance and usage patterns to anticipate growth needs, ensuring infrastructure is prepared for peak traffic and future scaling.
Conduct capacity planning and make recommendations for resource allocation, balancing performance, and cost.
Observability and Logging Implement and maintain observability tools (e.g., Prometheus, Grafana, ELK Stack) to gain insights into system behavior and proactively identify issues.
Ensure that logging, metrics, and traceability are set up to enable comprehensive debugging and troubleshooting.
Disaster Recovery and Business Continuity Contribute to the design and testing of disaster recovery plans to ensure fast recovery of critical services in the event of major incidents.
Regularly test backup and recovery processes to ensure data integrity and system continuity.
Security and Compliance Collaboration Work with the security team to ensure compliance with banking regulations (e.g., PCI-DSS, GDPR, POPIA) and implement security best practices in system design.
Monitor for and respond to security alerts to maintain a secure infrastructure.
Key Performance Indicators (KPIs) : System Uptime (Availability) : Aim for 99.95% or higher for critical systems.
Mean Time to Recovery (MTTR) : Target rapid resolution times for incidents (e.g., under 30 minutes).
Incident Volume and Severity : Reduction in the number and severity of incidents over time.
Change Failure Rate : Percentage of changes that result in incidents, aiming to keep it below 5%.
Automated Task Percentage : Proportion of operational tasks automated, improving efficiency and reducing manual errors.
ROLE REQUIREMENTS Educational Background : Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent work experience.
Preferred Qualifications : Previous experience in the financial services industry or within regulated environments.
Certification in relevant areas, such as AWS Certified SysOps Administrator, Certified Kubernetes Administrator (CKA), Familiarity with security standards and practices relevant to financial services.
Experience : 3+ years of experience in Site Reliability Engineering, Platform engineering, or a related role, ideally within a high-availability or financial services environment.
Technical Skills : Proficiency in monitoring and observability tools (e.g., Prometheus, Grafana, ELK, Datadog).
Strong scripting skills (Python, Bash, or similar) and experience with automation tools.
Familiarity with cloud platforms (AWS, GCP, Azure) and container orchestration tools (Docker, Kubernetes).
Experience with CI / CD tools and practices (e.g., Github actions, GitLab CI / CD, ArgoCD).
Proficient in Infrastructure as Code (IaC) tools like Terraform or CloudFormation.
Soft Skills : Strong problem-solving skills, with the ability to troubleshoot complex systems under pressure.
Excellent communication and teamwork skills, with the ability to work cross-functionally with engineering, Platform engineering, and security teams.
A proactive mindset, focused on reliability and continuous improvement.
Skills Action Planning, Application Development, Business Process Design, Computer Literacy, Data Management, Data Modeling, Evaluating Information, Identifying Customer Needs, Information Technology (IT) Support, Market Analysis, Oral Communications, Product Development, Technical Support, Technical Troubleshooting, Test Case Management, User Requirements Documentation, Web Development Competencies Business InsightCollaboratesCourageCultivates InnovationDecision QualityDrives ResultsEnsures AccountabilityManages Complexity Education Closing Date 14 July , 23 : 59 The appointment will be made from the designated group in line with the Employment Equity Plan of Old Mutual South Africa and the specific business unit in question.
The Old Mutual Story!
Similar Jobs OM Bank : Partner Performance Analyst locations 2 Locations time type Full time posted on Posted 6 Days Ago #J Ljbffr
Reliability Engineer • South Africa