[SPARK-55440][SQL] Types Framework - Phase 1a - Core Type System Foundation #54223

davidm-db · 2026-02-09T09:20:14Z

What changes were proposed in this pull request?

This PR introduces the foundation of the Spark Types Framework - a system for centralizing type-specific operations that are currently scattered across 50+ files using diverse patterns.

Framework interfaces (9 new files in sql/api and sql/catalyst):

Base traits: TypeOps (catalyst), TypeApiOps (sql-api)
Core interfaces: PhyTypeOps (physical type representation), LiteralTypeOps (literal creation), ExternalTypeOps (internal/external type conversion), FormatTypeOps (string formatting), EncodeTypeOps (row encoders)
Proof-of-concept implementation: TimeTypeOps + TimeTypeApiOps for TimeType

Integration points (10 existing files modified with check-and-delegate pattern):

PhysicalDataType.scala - physical type dispatch
CatalystTypeConverters.scala - external/internal type conversion (via TypeOpsConverter adapter)
ToStringBase.scala - string formatting
RowEncoder.scala - row encoding
literals.scala - default literal creation (Literal.default)
EncoderUtils.scala - encoder Java class mapping
CodeGenerator.scala - codegen Java class mapping
SpecificInternalRow.scala - mutable value creation
InternalRow.scala - row writer dispatch

Feature flag: spark.sql.types.framework.enabled (defaults to true in tests via Utils.isTesting, false otherwise), configured in SQLConf.scala + SqlApiConf.scala.

Interface hierarchy:

TypeOps (catalyst)
+-- PhyTypeOps      - Physical representation (PhysicalDataType, Java class, MutableValue, row writer)
+-- LiteralTypeOps  - Literal creation and defaults
+-- ExternalTypeOps - External <-> internal type conversion

TypeApiOps (sql-api)
+-- FormatTypeOps   - String formatting (CAST to STRING, SQL values)
+-- EncodeTypeOps   - Row encoding (AgnosticEncoder)

The split across sql/api and sql/catalyst follows existing Spark module separation - TypeApiOps lives in sql/api for client-side operations that depend on AgnosticEncoder, while TypeOps lives in sql/catalyst for server-side operations that depend on InternalRow, PhysicalDataType, etc.

All integration points use a check-and-delegate pattern at the beginning of existing match statements:

def someOperation(dt: DataType) = dt match {
  // Framework dispatch (guarded by feature flag)
  case _ if SQLConf.get.typesFrameworkEnabled && SomeOps.supports(dt) =>
    SomeOps(dt).someMethod()
  // Legacy types (unchanged)
  case DateType => ...
  case TimestampType => ...
}

This is the first of several planned PRs. Subsequent PRs will add client-side integrations (Spark Connect proto, Arrow SerDe, JDBC, Python, Thrift) and storage format integrations (Parquet, ORC, CSV, JSON, etc.).

Why are the changes needed?

Adding a new data type to Spark currently requires modifying 50+ files with scattered type-specific logic. Each file has its own conventions, and there is no compiler assistance to ensure completeness. Integration points are non-obvious and easy to miss - patterns include _: TimeType in Scala pattern matching, TimeNanoVector in Arrow SerDe, .hasTime()/.getTime() in proto fields, LocalTimeEncoder in encoder helpers, java.sql.Types.TIME in JDBC, instanceof TimeType in Java files, and compound matches like case LongType | ... | _: TimeType => that are invisible to naive searches.

The framework centralizes type-specific infrastructure operations in Ops interface classes. When adding a new type with the framework in place, a developer creates two Ops classes (one in sql/api, one in sql/catalyst) and registers them in the corresponding factory objects. The compiler enforces that all required interface methods are implemented, significantly reducing the risk of missing integration points.

Concrete example - TimeType: TimeType has integration points spread across 50+ files using the diverse patterns listed above (physical type mapping, literals, type converters, encoders, formatters, Arrow SerDe, proto conversion, JDBC, Python, Thrift, storage formats). With the framework, these are consolidated into two Ops classes: TimeTypeOps (~150 lines) and TimeTypeApiOps (~90 lines). A developer adding a new type with similar complexity would create two analogous files instead of touching 50+ files. The framework does not cover type-specific expressions (e.g., CurrentTime, TimeAddInterval) or SQL parser changes, which are inherently type-specific - it provides the primitives those build on.

This PR covers the core infrastructure integration. Subsequent PRs will add client-side integrations (Spark Connect proto, Arrow SerDe, JDBC, Python, Thrift) and storage format integrations (Parquet, ORC, CSV, JSON, etc.).

Does this PR introduce any user-facing change?

No. This is an internal refactoring behind a feature flag (spark.sql.types.framework.enabled). When the flag is enabled, framework-supported types use centralized Ops dispatch instead of direct pattern matching. Behavior is identical in both paths. The flag defaults to true in tests and false otherwise.

How was this patch tested?

The framework is a refactoring of existing dispatch logic - it changes the mechanism but preserves identical behavior. The feature flag is enabled by default in test environments (Utils.isTesting), so the entire existing test suite validates the framework code path. No new tests are added in this PR because the framework delegates to the same underlying logic that existing tests already cover.

In subsequent phases, the testing focus will be on:

Testing the framework itself (Ops interface contracts, roundtrip correctness, edge cases)
Designing a generalized testing mechanism that enforces proper test coverage for each type added through the framework

Was this patch authored or co-authored using generative AI tooling?

Co-authored with: claude-opus-4-6

davidm-db added 5 commits February 9, 2026 09:16

framework foundation - interfaces and config

569d6c7

framework - integrating interfaces

4901a96

build errors

59d812c

framework - add missing case in InternalRow

28bab39

scalafmt

c05a250

davidm-db mentioned this pull request Feb 9, 2026

framework foundation - interfaces and config #54174

Closed

davidm-db marked this pull request as ready for review February 9, 2026 09:25

davidm-db mentioned this pull request Feb 9, 2026

[SPARK-53504][SQL] Type framework #51467

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-55440][SQL] Types Framework - Phase 1a - Core Type System Foundation #54223

[SPARK-55440][SQL] Types Framework - Phase 1a - Core Type System Foundation #54223

Uh oh!

davidm-db commented Feb 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[SPARK-55440][SQL] Types Framework - Phase 1a - Core Type System Foundation #54223

Are you sure you want to change the base?

[SPARK-55440][SQL] Types Framework - Phase 1a - Core Type System Foundation #54223

Uh oh!

Conversation

davidm-db commented Feb 9, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant