This job offer is not available in your country.

AWS Site Reliability Engineer (SRE)

PrescientCape Town, ZA

18 days ago

Job description

Job title : AWS Site Reliability Engineer (SRE)

Job Location : Western Cape, Cape Town

Deadline : October 11, 2025

Quick Recommended Links

Jobs by Location
Job by industries

Purpose of role :

We’re looking for an AWS Site Reliability Engineer (SRE) to help us build and operate highly reliable, secure, and scalable cloud platforms. This role is ideal for someone who thrives at the intersection of software engineering, cloud infrastructure, and operations, and enjoys automating everything.

As an AWS SRE, you’ll be a key player in shaping our cloud environment, mentoring engineers, and ensuring our AWS workloads are secure, cost-efficient, and always available.

Duties and responsibilities :

You will be responsible for building and operating resilient infrastructure, automating operational processes, and driving continuous improvements in system performance and availability. This role requires a balance of hands-on technical expertise, problem-solving skills, and a passion for delivering highly reliable services that support critical business operations.

Reliability & Uptime

Design, implement, and maintain highly available and resilient AWS cloud infrastructure.

Monitor system health and performance, ensuring services meet SLAs.

Respond to and resolve production incidents, performing root cause analysis and implementing long-term fixes.

Automation & Scalability

Build automation for deployment, monitoring, scaling, and recovery using Infrastructure as Code (Terraform, AWS CDK, CloudFormation).

Automate repetitive operational tasks to reduce toil and improve system reliability.

Implement CI / CD pipelines to ensure smooth and reliable delivery of applications.

Monitoring & Observability

Configure and manage observability solutions (CloudWatch, Grafana, etc.).

Define and track Service Level Indicators (SLIs) and Objectives (SLOs).

Develop proactive alerting and anomaly detection mechanisms.

Security & Compliance

Apply AWS security best practices, including IAM governance, secrets management, encryption, and compliance monitoring.

Work closely with InfoSec teams to ensure systems adhere to regulatory standards (e.g., PCI DSS, POPIA, GDPR, ISO27001).

Perform regular audits of cloud resources, ensuring alignment with organizational policies.

Performance & Cost Optimization

Continuously optimize cloud infrastructure for performance, efficiency, and cost-effectiveness.

Analyse usage patterns and right-size resources or recommend reserved / spot instances where appropriate.

Provide visibility into AWS spend and assist teams in cost governance.

Incident & Problem Management

Drive post-incident reviews, documenting learnings and improving runbooks.

Develop self-healing and fault-tolerant systems to minimize impact of failures.

Collaboration & Continuous Improvement

Partner with development teams to embed reliability, scalability, and observability into applications.

Advocate and implement SRE best practices across the organization.

Mentor engineers on AWS, DevOps, and reliability engineering practices.

Required experience :

Strong experience (>

5 years) with AWS services (EC2, ECS / EKS, Lambda, RDS, DynamoDB, S3, CloudFront, VPC, Route 53, IAM).

Expertise in Infrastructure as Code (Terraform, AWS CDK, CloudFormation).

Proficiency in monitoring & observability tools (CloudWatch, Grafana, ELK / OpenSearch).

Experience with CI / CD pipelines (GitHub Actions, GitLab CI, AWS Code Pipeline).

Knowledge of containerization & orchestration (Docker, Kubernetes, ECS, EKS).

Strong scripting / coding skills (Python, Bash, Go, etc.).

Experience with incident management & on-call operations.

Required Qualifications :

AWS Professional certifications.

Experience running Kubernetes / EKS in production.

Knowledge of compliance frameworks (ISO27001, SOC2, PCI-DSS, POPIA).

Key competencies :

Problem-solving mindset with focus on root cause analysis and prevention.

Strong communication skills to collaborate across engineering, security, and business teams.

Ability to prioritize reliability, scalability, and performance in production systems.

Continuous improvement mindset, with passion for automation and efficiency.

ICT jobs

Create a job alert for this search

Reliability Engineer • Cape Town, ZA