Amazon's Rufus AI team is building the future of conversational shopping. Rufus helps hundreds of millions of customers find and discover products through natural language, and behind every response is an automated quality measurement system powered by LLM-as-a-Judge (LLMAJ) technology. We are seeking a Sr. Product Manager-Tech to own the quality governance, global scaling, and operational excellence of this judge portfolio.

You will work alongside Language Engineers who build and tune judges, Product Managers who define quality criteria and evaluation standards, Data Scientists who operate evaluation pipelines, and Engineering teams who build the infrastructure that runs evaluations. This is a high-autonomy role: you own your domain end-to-end and are expected to drive decisions, not just track workstreams.

This role sits at the intersection of AI evaluation, product management, and applied tooling. You will own the governance framework for a portfolio of dozens of LLM judges that power critical evaluation metrics used for release decisions, competitive benchmarking, and leadership reporting. You will drive the localization of judges from en-US to 5+ international marketplaces, facilitate model evaluation and debugging workflows, and build purpose-built tools and agents to automate governance operations at scale.

Key job responsibilities
- Own the LLMAJ governance framework: judge registry, versioning standards, quality validation gates, deprecation policies, and agreement rate monitoring across the full judge portfolio
- Own the international LLMAJ expansion: drive judge localization from en-US to global marketplaces, identify coverage gaps, define remediation plans, and validate judge quality per locale
- Facilitate model evaluation and debugging: work with Language Engineers and Scientists to trace response quality issues, inspect production logs, and root-cause judge disagreements or quality regressions
- Build purpose-built tools and agents: code automation using internal agent frameworks to streamline governance workflows, judge monitoring, data extraction, and reporting
- Define and own partner-facing quality metrics powered by LLMAJ, including defect rates, agreement rates, and evaluation dimension reporting across partner teams
- Drive human-in-the-loop validation workflows, coordinating between evaluation platforms and annotation teams to maintain judge calibration
- Drive discipline on evaluation requests by enforcing data-driven problem statements, clear scoping, and definition of done before work begins
- Write business requirements documents, contribute to leadership updates, and represent LLMAJ governance in cross-functional forums

A day in the life
You start the morning checking agreement rate dashboards for drift across international locales and triaging alerts. A new prompt release is shipping, so you pull evaluation results, spot two judges regressing in the Japanese marketplace, and open a debugging session with a Language Engineer to trace the root cause. After lunch, you present international judge coverage in a cross-functional review. In the afternoon, you ship an update to a governance agent you built that auto-generates weekly judge health reports. You close the day pushing back on an under-scoped evaluation request.

About the team
We are the team responsible for measuring whether Amazon's AI shopping assistant is actually good. We build LLM judges, define quality standards, and run evaluations that directly inform what ships to hundreds of millions of customers. Our team includes Language Engineers, Data Scientists, and Product Managers who work closely with Science, Engineering, and Product teams across the organization. We move fast, care deeply about measurement rigor, and believe that if you cannot measure quality automatically, you cannot improve it at scale.

See more open positions at Amazon

Powered by Getro.com

Privacy policy Cookie policy

IMPORTANT INFOMATION
The content of this webpage should not be construed as financial advice. Any decision to invest should be made only on the basis of the relevant documentation for each investment. Past performance is not necessarily a guide to future performance. The value of an investment may go down as well as up and investors may not get back the full amount invested. Investments in small unquoted companies carry an above-average level of risk. These investments are highly illiquid and as such, there may not be a readily available market to sell such an investment. Envestors does not provide specific individual advice on the suitability of investments with regard to a potential investor's individual circumstances, risk tolerance or investment objectives and investors should seek independent financial advice if they are in any doubt whether a product is suitable for them

FULL RISK WARNING
Envestors Ltd is authorised and regulated by the Financial Conduct Authority (FCA) in the UK. Firm Reference Number (FRN) 523952. Envestors is incorporated in England & Wales, registration number 07236828. Registered office: Envestors Limited, c/o Ballards LLP, Oakmoore Court, Kingswood Road, Hampton Lovett, Droitwich Spa, Worcestershire, WR9 0QH

Website Agency • Design Thing

Envestors Job Search

Senior Product Manager - Tech, GenAI, Amazon Rufus

Description

Working together with

Envestors

Policies

Investors

Entrepreneurs

Become a partner today