Direct message the job poster from Smart4 Energy
We’re looking for an AWS Site Reliability Engineer (SRE) to help build and operate highly reliable, secure, and scalable cloud platforms. This role is ideal for someone who thrives at the intersection of software engineering, cloud infrastructure, and operations, and enjoys automating everything.
As an AWS SRE, you’ll be a key player in shaping the cloud environment, mentoring engineers, and ensuring AWS workloads are secure, cost‑efficient, and always available.
Duties and Responsibilities
Reliability & Uptime
- Design, implement, and maintain highly available and resilient AWS cloud infrastructure.
- Monitor system health and performance, ensuring services meet SLAs.
- Respond to and resolve production incidents, performing root cause analysis and implementing long‑term fixes.
Automation & Scalability
Build automation for deployment, monitoring, scaling, and recovery using Infrastructure as Code (Terraform, AWS CDK, CloudFormation).Automate repetitive operational tasks to reduce toil and improve system reliability.Implement CI / CD pipelines to ensure smooth and reliable delivery of applications.Monitoring & Observability
Configure and manage observability solutions (CloudWatch, Grafana, etc.).Define and track Service Level Indicators (SLIs) and Objectives (SLOs).Develop proactive alerting and anomaly detection mechanisms.Security & Compliance
Apply AWS security best practices, including IAM governance, secrets management, encryption, and compliance monitoring.Work closely with InfoSec teams to ensure systems adhere to regulatory standards (e.g., PCI DSS, POPIA, GDPR, ISO27001).Perform regular audits of cloud resources, ensuring alignment with organizational policies.Performance & Cost Optimization
Continuously optimize cloud infrastructure for performance, efficiency, and cost‑effectiveness.Analyse usage patterns and right‑size resources or recommend reserved / spot instances where appropriate.Provide visibility into AWS spend and assist teams in cost governance.Incident & Problem Management
Drive post‑incident reviews, documenting learnings and improving runbooks.Develop self‑healing and fault‑tolerant systems to minimize impact of failures.Partner with development teams to embed reliability, scalability, and observability into applications.Advocate and implement SRE best practices across the organization.Mentor engineers on AWS, DevOps, and reliability engineering practices.Required Experience
Strong experience (>5 years) with AWS services (EC2, ECS / EKS, Lambda, RDS, DynamoDB, S3, CloudFront, VPC, Route 53, IAM).
Expertise in Infrastructure as Code (Terraform, AWS CDK, CloudFormation).Proficiency in monitoring & observability tools (CloudWatch, Grafana, ELK / OpenSearch).Experience with CI / CD pipelines (GitHub Actions, GitLab CI, AWS Code Pipeline).Knowledge of containerization & orchestration (Docker, Kubernetes, ECS, EKS).Experience with incident management & on‑call operations.Required Qualifications
AWS Professional certifications.Knowledge of compliance frameworks (ISO27001, SOC2, PCI‑DSS, POPIA).Problem‑solving mindset with focus on root cause analysis and prevention.Strong communication skills to collaborate across engineering, security, and business teams.Ability to prioritize reliability, scalability, and performance in production systems.Continuous improvement mindset, with passion for automation and efficiency.Location
Tokai, Cape Town – role is based in office.
Seniority level
Mid‑Senior levelEmployment type
Full‑timeJob function
Information TechnologyReferrals increase your chances of interviewing at Smart4 Energy by 2x
#J-18808-Ljbffr