feat: add GenAI evaluation OTel event support #1656

anirudha · 2026-02-10T05:26:24Z

Evaluation frameworks like strands_evals need a way to export evaluation results as OpenTelemetry events so they can be visualized in any OTel-compatible backend (Datadog, Jaeger, Honeycomb, etc.). Currently there is no standard way to emit gen_ai.evaluation.result events on spans from within the SDK.

open-telemetry/semantic-conventions#3398

#1633

This PR adds a lightweight evaluation telemetry API to strands.telemetry that follows the proposed gen_ai.evaluation.result OTel semantic convention. The API is opt-in — no telemetry is emitted unless the developer explicitly calls these functions.

Public API Changes

New exports from strands.telemetry:

from strands.telemetry import (
    EvaluationResult,
    EvaluationEventEmitter,
    add_evaluation_event,
    set_test_suite_context,
    set_test_case_context,
)

# Emit an evaluation event on a span
add_evaluation_event(
    span,
    name="accuracy",
    score_value=0.95,
    score_label="pass",
    explanation="High accuracy on test set",
)

# Or use a pre-built result object
result = EvaluationResult(name="tone", score_value=0.88)
EvaluationEventEmitter.emit(span, result)

# Organize evaluations into test suites and cases
set_test_suite_context(span, run_id="run_001", name="Eval Suite", status="in_progress")
set_test_case_context(span, case_id="case_01", name="Greeting Test", status="pass")

None-valued fields are omitted from OTel attributes. None/non-recording spans are silently skipped.

Use Cases

Evaluation export: Automatically send evaluation results to any OTel backend when tracing is enabled
Test suite organization: Group evaluation spans into test suite runs with status tracking
Correlation: Link evaluation results back to specific model completions via response_id

Related Issues

N/A — new feature

Documentation PR

N/A — docs update will follow separately

Type of Change

New feature

Testing

31 tests (25 unit + 6 property-based with Hypothesis):

EvaluationResult dataclass construction and to_otel_attributes() mapping
EvaluationEventEmitter.emit() span interaction
add_evaluation_event() convenience function equivalence
set_test_suite_context() / set_test_case_context() attribute correctness
Edge cases: None span, non-recording span, missing name ValueError
Public API export verification
I ran hatch run prepare

Checklist

I have read the CONTRIBUTING document
I have added any necessary tests that prove my fix is effective or my feature works
I have updated the documentation accordingly
I have added an appropriate example to the documentation to outline the feature, or no new docs are needed
My changes generate no new warnings
Any dependent changes have been merged and published

anirudha · 2026-02-10T05:30:26Z

example to test.zip

OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318 hatch run python tmp/example_eval_trace.py 2>&1
Sending traces to http://localhost:4318

npm install && node server.js

anirudha · 2026-02-10T05:31:02Z

Waiting for trace data...

================================================================================
RECEIVED OTLP TRACE EXPORT

Resource: service.name=strands-agents
Scope: strands_evals

Span: invoke_agent
  span_id:  025b089308c02d68
  parent:   51f45fae131af50a
  attributes:
    test.case.id: "c1"
    test.case.name: "greeting"
    gen_ai.operation.name: "chat"
    test.case.result.status: "pass"
  events:
    [gen_ai.evaluation.result]
      gen_ai.evaluation.name: "accuracy"
      gen_ai.evaluation.score.value: 1
      gen_ai.evaluation.score.label: "pass"
      gen_ai.evaluation.explanation: "Rubric: Response contains expected content"

Span: invoke_agent
  span_id:  eab6ac57c6972a20
  parent:   51f45fae131af50a
  attributes:
    test.case.id: "c2"
    test.case.name: "summarization"
    gen_ai.operation.name: "chat"
    test.case.result.status: "pass"
  events:
    [gen_ai.evaluation.result]
      gen_ai.evaluation.name: "accuracy"
      gen_ai.evaluation.score.value: 1
      gen_ai.evaluation.score.label: "pass"
      gen_ai.evaluation.explanation: "Rubric: Response contains expected content"

Span: test_suite_run
  span_id:  51f45fae131af50a
  attributes:
    test.suite.run.id: "run_e52b34ed520c"
    test.suite.name: "Evaluation Suite"
    test.suite.run.status: "success"

────────────────────────────────────────────────────────────────────────────────
spans: 3 | eval events: 2

feat: add GenAI evaluation OTel event support

506853a

github-actions bot added the size/l label Feb 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add GenAI evaluation OTel event support #1656

feat: add GenAI evaluation OTel event support #1656

Uh oh!

anirudha commented Feb 10, 2026 •

edited

Loading

Uh oh!

anirudha commented Feb 10, 2026

Uh oh!

anirudha commented Feb 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat: add GenAI evaluation OTel event support #1656

Are you sure you want to change the base?

feat: add GenAI evaluation OTel event support #1656

Uh oh!

Conversation

anirudha commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Public API Changes

Use Cases

Related Issues

Documentation PR

Type of Change

Testing

Checklist

Uh oh!

anirudha commented Feb 10, 2026

Uh oh!

anirudha commented Feb 10, 2026

================================================================================ RECEIVED OTLP TRACE EXPORT

──────────────────────────────────────────────────────────────────────────────── spans: 3 | eval events: 2

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

anirudha commented Feb 10, 2026 •

edited

Loading

================================================================================
RECEIVED OTLP TRACE EXPORT

────────────────────────────────────────────────────────────────────────────────
spans: 3 | eval events: 2