Enhance evaluation functionality with support for multiple runs and d…#4495
Enhance evaluation functionality with support for multiple runs and d…#4495jayy-77 wants to merge 1 commit intogoogle:mainfrom
Conversation
…ynamic config discovery
Summary of ChangesHello @jayy-77, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the evaluation framework by introducing the ability to run evaluation cases multiple times, providing more statistically sound results. It also improves the flexibility of evaluation setup by allowing dynamic discovery of configuration files alongside test data, reducing the need for explicit global configuration. Furthermore, the changes broaden the compatibility of agent loading, supporting both synchronous and asynchronous agent initialization patterns. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
Response from ADK Triaging Agent Hello @jayy-77, thank you for your contribution! To help us review this pull request, could you please fill out the template in the PR description? Specifically, we need:
This information is important for our review process as outlined in our contribution guidelines. Thanks! |
There was a problem hiding this comment.
Code Review
This pull request enhances the evaluation functionality by adding support for multiple runs and dynamic configuration discovery. The changes are well-structured, but I've identified a critical bug in src/google/adk/cli/cli_tools_click.py that could cause an UnboundLocalError. I have provided a suggestion to fix this. Additionally, I've included a suggestion to optimize a loop for better performance when handling a large number of evaluation sets. The other changes, such as refactoring and adding new parameters, look good.
| eval_metrics_by_eval_set_id = {} | ||
| global_eval_metrics = None | ||
| if config_file_path: | ||
| eval_config = get_evaluation_criteria_or_default(config_file_path) | ||
| print(f"Using evaluation criteria: {eval_config}") | ||
| global_eval_metrics = get_eval_metrics_from_config(eval_config) |
There was a problem hiding this comment.
There's a potential UnboundLocalError for eval_config. It is used on line 922, but it's only defined within this if config_file_path: block, or later when handling eval set IDs. If config_file_path is not provided and the code proceeds to handle eval set file paths, eval_config will not be defined when it's needed for UserSimulatorProvider.
To fix this, you should initialize eval_config unconditionally at the beginning of the function. Applying this fix will also allow you to simplify the code in a couple of other places:
- On lines 869-873, you can reuse the
eval_configvariable instead of callingget_evaluation_criteria_or_defaultagain. - On lines 900-903, you can remove the redundant call to
get_evaluation_criteria_or_default.
| eval_metrics_by_eval_set_id = {} | |
| global_eval_metrics = None | |
| if config_file_path: | |
| eval_config = get_evaluation_criteria_or_default(config_file_path) | |
| print(f"Using evaluation criteria: {eval_config}") | |
| global_eval_metrics = get_eval_metrics_from_config(eval_config) | |
| eval_config = get_evaluation_criteria_or_default(config_file_path) | |
| eval_metrics_by_eval_set_id = {} | |
| global_eval_metrics = None | |
| if config_file_path: | |
| print(f"Using evaluation criteria: {eval_config}") | |
| global_eval_metrics = get_eval_metrics_from_config(eval_config) |
| eval_results = [] | ||
| for eval_set_id, eval_metrics in eval_metrics_by_eval_set_id.items(): | ||
| inference_results_for_eval_set = [ | ||
| inference_result | ||
| for inference_result in inference_results | ||
| if inference_result.eval_set_id == eval_set_id | ||
| ] | ||
| if not inference_results_for_eval_set: | ||
| continue | ||
| eval_results.extend( | ||
| asyncio.run( | ||
| _collect_eval_results( | ||
| inference_results=inference_results_for_eval_set, | ||
| eval_service=eval_service, | ||
| eval_metrics=eval_metrics, | ||
| ) | ||
| ) | ||
| ) |
There was a problem hiding this comment.
The current implementation filters inference_results inside the loop for each eval_set_id. This can be inefficient if you have a large number of eval sets and inference results, as it iterates over all results for each set (O(num_eval_sets * num_inference_results)).
You can improve performance by grouping the inference results by eval_set_id once before the loop.
| eval_results = [] | |
| for eval_set_id, eval_metrics in eval_metrics_by_eval_set_id.items(): | |
| inference_results_for_eval_set = [ | |
| inference_result | |
| for inference_result in inference_results | |
| if inference_result.eval_set_id == eval_set_id | |
| ] | |
| if not inference_results_for_eval_set: | |
| continue | |
| eval_results.extend( | |
| asyncio.run( | |
| _collect_eval_results( | |
| inference_results=inference_results_for_eval_set, | |
| eval_service=eval_service, | |
| eval_metrics=eval_metrics, | |
| ) | |
| ) | |
| ) | |
| inference_results_by_eval_set_id = {} | |
| for res in inference_results: | |
| inference_results_by_eval_set_id.setdefault(res.eval_set_id, []).append(res) | |
| eval_results = [] | |
| for eval_set_id, eval_metrics in eval_metrics_by_eval_set_id.items(): | |
| inference_results_for_eval_set = inference_results_by_eval_set_id.get(eval_set_id) | |
| if not inference_results_for_eval_set: | |
| continue | |
| eval_results.extend( | |
| asyncio.run( | |
| _collect_eval_results( | |
| inference_results=inference_results_for_eval_set, | |
| eval_service=eval_service, | |
| eval_metrics=eval_metrics, | |
| ) | |
| ) | |
| ) |
Please ensure you have read the contribution guide before creating a pull request.
Link to Issue or Description of Change
1. Link to an existing issue (if applicable):
2. Or, if no issue exists, describe the change:
If applicable, please follow the issue templates to provide as much detail as
possible.
Problem:
A clear and concise description of what the problem is.
Solution:
A clear and concise description of what you want to happen and why you choose
this solution.
Testing Plan
Please describe the tests that you ran to verify your changes. This is required
for all PRs that are not small documentation or typo fixes.
Unit Tests:
Please include a summary of passed
pytestresults.Manual End-to-End (E2E) Tests:
Please provide instructions on how to manually test your changes, including any
necessary setup or configuration. Please provide logs or screenshots to help
reviewers better understand the fix.
Checklist
Additional context
Add any other context or screenshots about the feature request here.