Amazon Bedrock Model Evaluation now includes LLM-as-a-judge (Preview)

Amazon Bedrock Model Evaluation allows you to evaluate, compare, and select the best foundation models for your use case. Now, you can use a new evaluation capability: LLM-as-a-judge in Preview. This allows you to choose an LLM as your judge to ensure you have the right combination of evaluator models and models being evaluated. You can choose from several available judge LLMs on Amazon Bedrock. You can also select curated quality metrics such as correctness, completeness, and professional style and tone, as well as responsible AI metrics such as harmfulness and answer refusal. You can now also bring your own prompt dataset to ensure the evaluation is customized for your data, and you can compare results across evaluation jobs to make decisions faster.

Previously, you had a choice between human-based model evaluation and automatic evaluation with exact string matching and other traditional NLP metrics. These methods, while fast, did not provide a strong correlation with human evaluators. Now, with LLM-as-a-judge, you can get human-like evaluation quality at a much lower cost than full human-based evaluations, while saving weeks of time. You can use built-in metrics to evaluate objective facts or perform subjective evaluations of writing style and tone on your dataset.

To learn more about Amazon Bedrock Model Evaluation’s new LLM-as-a-judge, including available AWS regions read the AWS News Blog and visit the Amazon Bedrock Evaluations page. To get started, sign in to the AWS Management Console or use the Amazon Bedrock APIs.

Source:: Amazon AWS