Skip to content

Conversation

@anirudha
Copy link

@anirudha anirudha commented Feb 10, 2026

Evaluation frameworks like strands_evals need a way to export evaluation results as OpenTelemetry events so they can be visualized in any OTel-compatible backend (Datadog, Jaeger, Honeycomb, etc.). Currently there is no standard way to emit gen_ai.evaluation.result events on spans from within the SDK.

open-telemetry/semantic-conventions#3398

#1633

This PR adds a lightweight evaluation telemetry API to strands.telemetry that follows the proposed gen_ai.evaluation.result OTel semantic convention. The API is opt-in — no telemetry is emitted unless the developer explicitly calls these functions.

Public API Changes

New exports from strands.telemetry:

from strands.telemetry import (
    EvaluationResult,
    EvaluationEventEmitter,
    add_evaluation_event,
    set_test_suite_context,
    set_test_case_context,
)

# Emit an evaluation event on a span
add_evaluation_event(
    span,
    name="accuracy",
    score_value=0.95,
    score_label="pass",
    explanation="High accuracy on test set",
)

# Or use a pre-built result object
result = EvaluationResult(name="tone", score_value=0.88)
EvaluationEventEmitter.emit(span, result)

# Organize evaluations into test suites and cases
set_test_suite_context(span, run_id="run_001", name="Eval Suite", status="in_progress")
set_test_case_context(span, case_id="case_01", name="Greeting Test", status="pass")

None-valued fields are omitted from OTel attributes. None/non-recording spans are silently skipped.

Use Cases

  • Evaluation export: Automatically send evaluation results to any OTel backend when tracing is enabled
  • Test suite organization: Group evaluation spans into test suite runs with status tracking
  • Correlation: Link evaluation results back to specific model completions via response_id

Related Issues

N/A — new feature

Documentation PR

N/A — docs update will follow separately

Type of Change

New feature

Testing

31 tests (25 unit + 6 property-based with Hypothesis):

  • EvaluationResult dataclass construction and to_otel_attributes() mapping

  • EvaluationEventEmitter.emit() span interaction

  • add_evaluation_event() convenience function equivalence

  • set_test_suite_context() / set_test_case_context() attribute correctness

  • Edge cases: None span, non-recording span, missing name ValueError

  • Public API export verification

  • I ran hatch run prepare

Checklist

  • I have read the CONTRIBUTING document
  • I have added any necessary tests that prove my fix is effective or my feature works
  • I have updated the documentation accordingly
  • I have added an appropriate example to the documentation to outline the feature, or no new docs are needed
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

@anirudha
Copy link
Author

example to test.zip

OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318 hatch run python tmp/example_eval_trace.py 2>&1
Sending traces to http://localhost:4318

npm install && node server.js

@anirudha
Copy link
Author

Waiting for trace data...

================================================================================
RECEIVED OTLP TRACE EXPORT

Resource: service.name=strands-agents
Scope: strands_evals

Span: invoke_agent
  span_id:  025b089308c02d68
  parent:   51f45fae131af50a
  attributes:
    test.case.id: "c1"
    test.case.name: "greeting"
    gen_ai.operation.name: "chat"
    test.case.result.status: "pass"
  events:
    [gen_ai.evaluation.result]
      gen_ai.evaluation.name: "accuracy"
      gen_ai.evaluation.score.value: 1
      gen_ai.evaluation.score.label: "pass"
      gen_ai.evaluation.explanation: "Rubric: Response contains expected content"

Span: invoke_agent
  span_id:  eab6ac57c6972a20
  parent:   51f45fae131af50a
  attributes:
    test.case.id: "c2"
    test.case.name: "summarization"
    gen_ai.operation.name: "chat"
    test.case.result.status: "pass"
  events:
    [gen_ai.evaluation.result]
      gen_ai.evaluation.name: "accuracy"
      gen_ai.evaluation.score.value: 1
      gen_ai.evaluation.score.label: "pass"
      gen_ai.evaluation.explanation: "Rubric: Response contains expected content"

Span: test_suite_run
  span_id:  51f45fae131af50a
  attributes:
    test.suite.run.id: "run_e52b34ed520c"
    test.suite.name: "Evaluation Suite"
    test.suite.run.status: "success"

────────────────────────────────────────────────────────────────────────────────
spans: 3 | eval events: 2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant