Reliability Engineer, Global Reliability Intelligence Programs
Amazon
London, UK
Description
A Reliability Engineer focused on RCA and FMEA hunts down the true causes of failures and eliminates them before they happen again. They lead high-impact investigations, turn data into clear actions, and drive measurable improvements in uptime and performance. This role also gets ahead of problems by identifying risks early through FMEA and building smarter, more reliable systems. If you enjoy solving complex problems, influencing decisions, and delivering real results at scale, this is where you do it.
This role may require up to 50% travel.
Key job responsibilities
• Lead Root Cause Analysis (RCA) for high-impact and recurring failures, driving deep-dive investigations to identify true root causes and ensure effective, lasting corrective actions
• Develop, maintain, and continuously improve Failure Modes and Effects Analysis (FMEA) to proactively identify risks, prioritize mitigation, and prevent future failures
• Analyze equipment and operational data to identify trends, systemic issues, and performance gaps, translating findings into actionable reliability improvements
• Build and maintain BI dashboards, automated reports, and performance metrics (e.g., uptime, MTBF, failure rates) to enable data-driven decision-making
• Lead cross-functional execution of reliability improvements by partnering with operations, engineering, maintenance, and external vendors across multiple sites and regions
• Drive development and enhancement of RCA/FMEA tools and software by working closely with DevOps and technical teams, including requirements gathering, testing, and user feedback
• Establish and standardize reliability best practices, while supporting policy creation, training, and organizational adoption of RCA and FMEA methodologies
A day in the life
In this role, you will partner closely with DevOps teams to refine and improve tools and systems that support RCA and FMEA at scale. You will analyze failure trends to identify recurring issues, systemic gaps, and opportunities to improve reliability and performance. A key focus is supporting FMEA initiatives by helping teams proactively identify risks and implement effective mitigation strategies. You will also review high-impact events and completed RCAs to ensure quality, consistency, and actionable outcomes. In addition, you will collaborate with engineers, operators, and vendors across regions to align corrective actions and drive execution, strengthening how the organization learns from and prevents failures.


