Senior Product Manager - Tech, GenAI, Amazon Rufus

Amazon

Amazon

Product

London, UK

Posted on May 9, 2026

Description

Amazon's Rufus AI team is building the future of conversational shopping. Rufus helps hundreds of millions of customers find and discover products through natural language, and behind every response is an automated quality measurement system powered by LLM-as-a-Judge (LLMAJ) technology. We are seeking a Sr. Product Manager-Tech to own the quality governance, global scaling, and operational excellence of this judge portfolio.

You will work alongside Language Engineers who build and tune judges, Product Managers who define quality criteria and evaluation standards, Data Scientists who operate evaluation pipelines, and Engineering teams who build the infrastructure that runs evaluations. This is a high-autonomy role: you own your domain end-to-end and are expected to drive decisions, not just track workstreams.

This role sits at the intersection of AI evaluation, product management, and applied tooling. You will own the governance framework for a portfolio of dozens of LLM judges that power critical evaluation metrics used for release decisions, competitive benchmarking, and leadership reporting. You will drive the localization of judges from en-US to 5+ international marketplaces, facilitate model evaluation and debugging workflows, and build purpose-built tools and agents to automate governance operations at scale.

Key job responsibilities
- Own the LLMAJ governance framework: judge registry, versioning standards, quality validation gates, deprecation policies, and agreement rate monitoring across the full judge portfolio
- Own the international LLMAJ expansion: drive judge localization from en-US to global marketplaces, identify coverage gaps, define remediation plans, and validate judge quality per locale
- Facilitate model evaluation and debugging: work with Language Engineers and Scientists to trace response quality issues, inspect production logs, and root-cause judge disagreements or quality regressions
- Build purpose-built tools and agents: code automation using internal agent frameworks to streamline governance workflows, judge monitoring, data extraction, and reporting
- Define and own partner-facing quality metrics powered by LLMAJ, including defect rates, agreement rates, and evaluation dimension reporting across partner teams
- Drive human-in-the-loop validation workflows, coordinating between evaluation platforms and annotation teams to maintain judge calibration
- Drive discipline on evaluation requests by enforcing data-driven problem statements, clear scoping, and definition of done before work begins
- Write business requirements documents, contribute to leadership updates, and represent LLMAJ governance in cross-functional forums

A day in the life
You start the morning checking agreement rate dashboards for drift across international locales and triaging alerts. A new prompt release is shipping, so you pull evaluation results, spot two judges regressing in the Japanese marketplace, and open a debugging session with a Language Engineer to trace the root cause. After lunch, you present international judge coverage in a cross-functional review. In the afternoon, you ship an update to a governance agent you built that auto-generates weekly judge health reports. You close the day pushing back on an under-scoped evaluation request.

About the team
We are the team responsible for measuring whether Amazon's AI shopping assistant is actually good. We build LLM judges, define quality standards, and run evaluations that directly inform what ships to hundreds of millions of customers. Our team includes Language Engineers, Data Scientists, and Product Managers who work closely with Science, Engineering, and Product teams across the organization. We move fast, care deeply about measurement rigor, and believe that if you cannot measure quality automatically, you cannot improve it at scale.