Talent.com
Site Reliability Engineering

Site Reliability Engineering

Reach Digital HealthWorkFromHome, Western Cape, South Africa
2 days ago
Job description

About ReachDigital Health

About ReachReach Digital Health is transforming how public healthcare is delivered. Using innovative digital tools, we connect people, especially those who cannot easily access traditional care, to the information and support they need to live healthier lives. From maternal and child health to HIV / AIDS support and immunisation, our work helps close critical gaps in healthcare and ensures that underserved communities are not left behind.With more than 16 years of experience, we know that technology alone is not enough. Real impact comes from combining our scalable, multi-channel technology with the partnerships, systems and expertise needed to drive meaningful change.

By joining Reach, you will be part of a mission-driven team tackling some of the world's toughest health challenges, making healthcare more inclusive and helping save lives every day.At Reach, you will do work that matters while enjoying the balance you deserve.

Why Work With Us

Our team is guided by our values : grit, empathy, collaboration, simplicity, and curiosity.

We are proud to be one of the first South African companies to embrace a four-day work week, giving our team more time for life outside of work.

Alongside competitive salaries, we invest in your growth through ongoing training and career development, creating opportunities to thrive in a supportive and innovative environment.

We put people at the centre of everything we do - both internally and in our work.

We are creating an inclusive, diverse environment where everyone feels welcome, accepted, and supported.

We are a progressive and equal-opportunity employer.

About the role

Join Reach as our Site Reliability Engineering Lead and play a central role in designing and maintaining the secure infrastructure that powers vital health services. You'll lead the SRE team, automate processes, and improve system reliability while ensuring adherence to data privacy regulations and security best practices, all while working on projects that directly impact communities in need. Your ideas and innovations will have real-world effects on healthcare access and outcomes. The role requires advanced infrastructure engineering and security expertise with a passion for healthcare technology and data compliance.

Key Focus Areas

  • Team Management and Growth : Foster the professional development of the SRE team through mentorship, one-on-one sessions, and skill-building opportunities.
  • Collaboration : Work closely with cross-functional teams, including development and operations, to implement best practices and foster a culture of collaboration and innovation.
  • Infrastructure reliability and performance : Monitoring, measuring, and improving the reliability and performance of our systems. Identify and address bottlenecks, optimize system performance, and implement strategies for scaling infrastructure to meet growing demands.
  • Maintenance, upgrades, and security updates.
  • Automation and tooling : You will design and develop software and scripts that automate and streamline various aspects of infrastructure and operations. Assisting other teams with deployment and updates of their applications and services.
  • Administration : Administration of our infrastructure accounts and critical services, providing strategic oversight for our hosting infrastructure and vendor relationships. Owns the hosting and billing lifecycle, from monitoring and analysis to implementing cost-optimization strategies, ensuring financial efficiency and predictability across our platforms.
  • Data Management and Security : Lead Information Security Management System (ISMS) compliance initiatives including policy development, risk assessment processes, and security framework implementation. Manage security tools (antivirus, password management, security awareness training), ensuring data, security and infrastructure policies and best practices are adhered to. Work with Legal and Projects teams to develop and enforce policies and procedures for data collection, storage, and access to ensure compliance with data privacy regulations, implementing and monitoring security measures to protect sensitive health information, and managing data backups and disaster recovery.
  • Innovation : You will research and evaluate new technologies and methodologies that can enhance our systems and processes, and implement proof-of-concepts and prototypes to demonstrate their feasibility and value.

Responsibilities and Duties

  • Lead a team of Site Reliability Engineers, providing mentorship, guidance, and technical expertise.
  • Establish and enforce SRE best practices to improve system reliability and operational efficiency.
  • Collaborate with development teams to design, implement, and maintain scalable and reliable infrastructure.
  • Develop and implement incident response plans, ensuring timely resolution of system outages and performance issues.
  • Conduct performance reviews, set goals, and facilitate professional development for team members.
  • Drive the implementation of automation tools, software and processes to improve infrastructure and operational efficiency of our systems and ensure they follow best practices.
  • Monitor system health, analyze trends, and implement proactive measures to prevent incidents.
  • Advise on and contribute to new or emerging technologies that might be relevant to Reach.
  • Work closely with the Head of Engineering and other Engineering Leads to ensure alignment within the engineering department.
  • Design and develop tools and software that automate and improve the infrastructure and operation of our systems and ensure they follow best practices.
  • Perform code reviews, testing and debugging and troubleshooting of the software and tools developed by the SRE team and assist other engineering teams with the same.
  • Design and implement security features, conduct security audits and risk assessments, manage enterprise security tools, and coordinate penetration testing exercises while serving as technical point of contact for external security audits.
  • Develop and enforce Information Security Management System (ISMS) compliance policies aligned with POPIA and ISO.
  • Lead security awareness programs, manage security training and phishing campaigns, and collaborate with teams to ensure alignment between technical and regulatory requirements across organisational systems.
  • Suggest and implement improvements to current ways of working / processes (or gaps in the processes) that are relevant to the current and future success of the SRE team and Reach as a whole.
  • Qualifications

    An honours degree in Computer Science or Engineering or equivalent experience.

    8+ years of experience as a senior site reliability engineer, senior software engineer, or system administrator, working with large-scale, distributed, and cloud-based systems.

    4+ years of experience as a team lead, manager, or mentor, leading and developing site reliability engineers or software engineers.

    Skills and Experience Required

  • Proficient in one or more programming languages, such as Python, Go, Java, or C++.
  • Proficient in one or more scripting languages, such as Bash, Perl, or Ruby.
  • Proficient in one or more cloud platforms, such as AWS, Azure, or GCP.
  • Proficient in one or more UNIX-like operating systems.
  • Proficient in one or more configuration management and deployment tools, such as Ansible, Chef, Puppet, or Terraform.
  • Proficient in one or more monitoring and alerting tools, such as Prometheus, Grafana, Datadog, or Splunk.
  • Proficient in one or more container and orchestration tools, such as Docker, Kubernetes.
  • Proficient in one or more web servers and proxies, such as Apache, Nginx, or Envoy.
  • Proficient in one or more databases and data stores, such as MySQL, PostgreSQL, MongoDB, or Redis.
  • Proficient in one or more version control and collaboration tools, such as Git.
  • Knowledgeable in the concepts and principles of site reliability engineering, such as SLIs, SLOs, error budgets, incident management, postmortems, and blameless culture.
  • Knowledgeable in the concepts and principles of software engineering, such as design patterns, code quality, testing, debugging, and documentation.
  • Knowledgeable in the concepts and principles of performance engineering, such as profiling, benchmarking, load testing, and capacity planning.
  • Knowledgeable in the concepts and principles of distributed computing, such as concurrency, parallelism, synchronisation, and consensus.
  • Excellent communication and collaboration skills, and ability to work effectively in a cross-functional and remote team environment.
  • Excellent problem-solving and analytical skills, and ability to troubleshoot and resolve complex issues in a timely and efficient manner.
  • Excellent learning and innovation skills, and ability to research and evaluate new technologies and methodologies.
  • Experience implementing ISO or POPIA standards with expertise in security audits, policy development, and regulatory compliance.
  • Proficiency managing enterprise security tools (antivirus, password management, SIEM), penetration testing oversight, and incident response procedures.
  • Experience leading security awareness programs, developing security frameworks, and implementing organization-wide security policies and training initiatives.
  • How to Apply

    Ready to make a difference in public health?

    We welcome applicants from all backgrounds and encourage candidates of all genders, races, ages, religions, sexual orientations, and abilities to apply.

    Reach Digital Health is an equal opportunity and affirmative action employer, committed to creating a diverse and inclusive workplace. Submit your application today and join our mission-driven team to make a real impact in public health.

    #J-18808-Ljbffr

    Create a job alert for this search

    Site Reliability • WorkFromHome, Western Cape, South Africa

    Related jobs
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Robin App ASWorkFromHome, Western Cape, South Africa
    We are a pioneer in Legal AI, built on proprietary models, licensed data, anddeeppartnerships with Anthropic and AWS.Since 2019, we’ve expanded our footprint to 4 continents and have been supportin...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    CanonicalWorkFromHome, Western Cape, South Africa
    Site Reliability Engineer role at Canonical.Canonical is a leading provider of open source software and operating systems to the enterprise and technology markets, known for Ubuntu and open source ...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer (SRE II) (Kubernetes / Python)

    Site Reliability Engineer (SRE II) (Kubernetes / Python)

    k0deHutWorkFromHome, Western Cape, South Africa
    Site Reliability Engineer (SRE II) (Kubernetes / Python).Job Openings Site Reliability Engineer (SRE II) (Kubernetes / Python). About the job Site Reliability Engineer (SRE II) (Kubernetes / Python).Inter...Show moreLast updated: 4 days ago
    • Promoted
    Site Foreman - Installation Engineering

    Site Foreman - Installation Engineering

    CfwfansCape Town, Western Cape, South Africa
    CFW Environmental is an air systems engineering company, part of the CFW Industries group established in.We focus primarily on large turnkey projects, and manage the entire development process, fro...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    LexisNexisWorkFromHome, Western Cape, South Africa
    LexisNexis Legal & Professional, which serves customers in more than 150 countries with 11,800 employees worldwide, is part of. Our company has been a long-time leader in deploying AI and advanced t...Show moreLast updated: 30+ days ago
    • Promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    Sana CommerceWorkFromHome, Western Cape, South Africa
    What started in 2007 with a pizza and a plan has grown into a fast-moving SaaS company empowering manufacturers, distributors, and wholesalers to thrive in complex B2B commerce.Our mission is simpl...Show moreLast updated: 12 days ago
    • Promoted
    Intermediate Site Reliability Engineer, Database Operations

    Intermediate Site Reliability Engineer, Database Operations

    GitLabWorkFromHome, Western Cape, South Africa
    GitLab is an open-core software company that develops the most comprehensive AI-powered DevSecOps Platform, used by more than 100,000 organizations. Our mission is to enable everyone to contribute t...Show moreLast updated: 19 days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Electrum PaymentsCape Town, Western Cape, South Africa
    Electrum is the next-generation payments technology company that provides cloud-native software to optimise the processing of financial transactions. Since 2012, we have established ourselves as a r...Show moreLast updated: 30+ days ago
    • Promoted
    SRE (Site Reliability Engineer)

    SRE (Site Reliability Engineer)

    TravelstartCape Town, Western Cape, South Africa
    SRE (Site Reliability Engineer).Continue with Google Continue with Google.Be among the first 25 applicants.SRE (Site Reliability Engineer). Get AI-powered advice on this job and more exclusive featu...Show moreLast updated: 30+ days ago
    • Promoted
    SRE (Site Reliability Engineer)

    SRE (Site Reliability Engineer)

    TravelLab Global ABCape Town, Western Cape, South Africa
    Our Travelstart team is seeking an.SRE (Site Reliability Engineer).This role ensures the reliability, performance, and scalability of the Travelstart systems. This role bridges the gap between softw...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Mind DetectWorkFromHome, Wes-Kaap, South Africa
    Mind Detect City of Cape Town, Western Cape, South Africa.Site Reliability Engineer (SRE) to join their world-class Engineering team, located in Cape Town (hybrid). As SRE, you will be responsible f...Show moreLast updated: 16 days ago
    • Promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    DuckDuckGoWorkFromHome, Western Cape, South Africa
    Be among the first 25 applicants.Hi, we're DuckDuckGo, the online protection company and remote-first team of 300+ on a mission to raise the standard of trust online. Founded in 2008 and profitable ...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    Robin AIWorkFromHome, Wes-Kaap, South Africa
    Robin AI City of Cape Town, Western Cape, South Africa.Join or sign in to find your next job.Robin AI City of Cape Town, Western Cape, South Africa. Robin is on a mission to rebuild the legal indust...Show moreLast updated: 30+ days ago
    • Promoted
    Site Reliability Engineering Manager

    Site Reliability Engineering Manager

    CanonicalWorkFromHome, Western Cape, South Africa
    Site Reliability Engineering Manager role at Canonical.Location : Remote in APAC region.Lead your team in daily agile devops practices. Represent the IS team to stakeholders, customers, and internal...Show moreLast updated: 30+ days ago
    • Promoted
    AWS Site Reliability Engineer (SRE) – Cape Town

    AWS Site Reliability Engineer (SRE) – Cape Town

    ClarkHouseCape Town, South Africa
    Youll take ownership of infrastructure, monitoring, and automation, building.Continuous Integration and Continuous Delivery (CI / CD) pipelines. Youll work closely with engineers and security teams to...Show moreLast updated: 14 days ago
    • Promoted
    Site Reliability Engineer

    Site Reliability Engineer

    LunoWorkFromHome, Western Cape, South Africa
    Luno is the crypto investment app you can rely on, enabling you to buy, store and explore crypto securely.We're committed to putting the power of cryptocurrency in everyone's hands sensibly and res...Show moreLast updated: 2 days ago
    • Promoted
    Senior Site Reliability Engineer

    Senior Site Reliability Engineer

    CanonicalWorkFromHome, Western Cape, South Africa
    Canonical is a leading provider of open source software and operating systems to the global enterprise and technology markets. Our platform, Ubuntu, is widely used in enterprise initiatives such as ...Show moreLast updated: 30+ days ago
    Site Reliability Engineer

    Site Reliability Engineer

    Electrum SoftwareCape Town, Western Cape, ZA
    Quick Apply
    Electrum is a next-generation payment software technology company.Since 2012, we've delivered trusted, enterprise-grade, cloud-native software to optimise financial transaction processing.Our deep ...Show moreLast updated: 30+ days ago