-
Notifications
You must be signed in to change notification settings - Fork 37
Improve evaluation harness: concurrent execution, robust error handling, and CLI model configuration #22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@jsham042 please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
Contributor License AgreementContribution License AgreementThis Contribution License Agreement (“Agreement”) is agreed to by the party signing below (“You”),
|
Summary
Improves the RCA-Agent evaluation harness with concurrent execution, robust error handling, CLI-driven model
configuration, and run resumability.
Changes
Evaluation Runner (
run_agent_standard.py)--workers Nflag for parallel query evaluation viamultiprocessing.Pool--source,--model,--api_key,--api_base, and--profileflags so modelconfiguration no longer requires editing YAML files
--profileflag to load named configurations fromapi_profiles.yaml(e.g.,--profile anthropic-sonnet)API Router (
api_router.py).create()tomessages.stream()for Anthropic callssystemparameter instead of passing as amessage
google.generativeaitogoogle.genaiclient withtypes.Content/types.PartAPIrate-limit vs quota-exhaustion vs connection errors
Controller (
controller.py)re.search(r"```json\n(.*)\n```")with flexible fallback patterns thathandle varied whitespace and formatting from different models
NoneAPI responses with graceful recovery instead of crashesConfiguration
api_profiles.example.yamlwith profile templates for Anthropic, OpenAI, and Google modelsapi_config.yaml,api_profiles.yaml, andsubmission/from version controlSecurity
.exampletemplates with placeholder values are committedTesting
Changes were validated by running the full OpenRCA benchmark across multiple model providers.