Skip to content

Conversation

@davidm-db
Copy link
Contributor

What changes were proposed in this pull request?

This PR introduces the foundation of the Spark Types Framework - a system for centralizing type-specific operations that are currently scattered across 50+ files using diverse patterns.

Framework interfaces (9 new files in sql/api and sql/catalyst):

  • Base traits: TypeOps (catalyst), TypeApiOps (sql-api)
  • Core interfaces: PhyTypeOps (physical type representation), LiteralTypeOps (literal creation), ExternalTypeOps (internal/external type conversion), FormatTypeOps (string formatting), EncodeTypeOps (row encoders)
  • Proof-of-concept implementation: TimeTypeOps + TimeTypeApiOps for TimeType

Integration points (10 existing files modified with check-and-delegate pattern):

  • PhysicalDataType.scala - physical type dispatch
  • CatalystTypeConverters.scala - external/internal type conversion (via TypeOpsConverter adapter)
  • ToStringBase.scala - string formatting
  • RowEncoder.scala - row encoding
  • literals.scala - default literal creation (Literal.default)
  • EncoderUtils.scala - encoder Java class mapping
  • CodeGenerator.scala - codegen Java class mapping
  • SpecificInternalRow.scala - mutable value creation
  • InternalRow.scala - row writer dispatch

Feature flag: spark.sql.types.framework.enabled (defaults to true in tests via Utils.isTesting, false otherwise), configured in SQLConf.scala + SqlApiConf.scala.

Interface hierarchy:

TypeOps (catalyst)
+-- PhyTypeOps      - Physical representation (PhysicalDataType, Java class, MutableValue, row writer)
+-- LiteralTypeOps  - Literal creation and defaults
+-- ExternalTypeOps - External <-> internal type conversion

TypeApiOps (sql-api)
+-- FormatTypeOps   - String formatting (CAST to STRING, SQL values)
+-- EncodeTypeOps   - Row encoding (AgnosticEncoder)

The split across sql/api and sql/catalyst follows existing Spark module separation - TypeApiOps lives in sql/api for client-side operations that depend on AgnosticEncoder, while TypeOps lives in sql/catalyst for server-side operations that depend on InternalRow, PhysicalDataType, etc.

All integration points use a check-and-delegate pattern at the beginning of existing match statements:

def someOperation(dt: DataType) = dt match {
  // Framework dispatch (guarded by feature flag)
  case _ if SQLConf.get.typesFrameworkEnabled && SomeOps.supports(dt) =>
    SomeOps(dt).someMethod()
  // Legacy types (unchanged)
  case DateType => ...
  case TimestampType => ...
}

This is the first of several planned PRs. Subsequent PRs will add client-side integrations (Spark Connect proto, Arrow SerDe, JDBC, Python, Thrift) and storage format integrations (Parquet, ORC, CSV, JSON, etc.).

Why are the changes needed?

Adding a new data type to Spark currently requires modifying 50+ files with scattered type-specific logic. Each file has its own conventions, and there is no compiler assistance to ensure completeness. Integration points are non-obvious and easy to miss - patterns include _: TimeType in Scala pattern matching, TimeNanoVector in Arrow SerDe, .hasTime()/.getTime() in proto fields, LocalTimeEncoder in encoder helpers, java.sql.Types.TIME in JDBC, instanceof TimeType in Java files, and compound matches like case LongType | ... | _: TimeType => that are invisible to naive searches.

The framework centralizes type-specific infrastructure operations in Ops interface classes. When adding a new type with the framework in place, a developer creates two Ops classes (one in sql/api, one in sql/catalyst) and registers them in the corresponding factory objects. The compiler enforces that all required interface methods are implemented, significantly reducing the risk of missing integration points.

Concrete example - TimeType: TimeType has integration points spread across 50+ files using the diverse patterns listed above (physical type mapping, literals, type converters, encoders, formatters, Arrow SerDe, proto conversion, JDBC, Python, Thrift, storage formats). With the framework, these are consolidated into two Ops classes: TimeTypeOps (~150 lines) and TimeTypeApiOps (~90 lines). A developer adding a new type with similar complexity would create two analogous files instead of touching 50+ files. The framework does not cover type-specific expressions (e.g., CurrentTime, TimeAddInterval) or SQL parser changes, which are inherently type-specific - it provides the primitives those build on.

This PR covers the core infrastructure integration. Subsequent PRs will add client-side integrations (Spark Connect proto, Arrow SerDe, JDBC, Python, Thrift) and storage format integrations (Parquet, ORC, CSV, JSON, etc.).

Does this PR introduce any user-facing change?

No. This is an internal refactoring behind a feature flag (spark.sql.types.framework.enabled). When the flag is enabled, framework-supported types use centralized Ops dispatch instead of direct pattern matching. Behavior is identical in both paths. The flag defaults to true in tests and false otherwise.

How was this patch tested?

The framework is a refactoring of existing dispatch logic - it changes the mechanism but preserves identical behavior. The feature flag is enabled by default in test environments (Utils.isTesting), so the entire existing test suite validates the framework code path. No new tests are added in this PR because the framework delegates to the same underlying logic that existing tests already cover.

In subsequent phases, the testing focus will be on:

  1. Testing the framework itself (Ops interface contracts, roundtrip correctness, edge cases)
  2. Designing a generalized testing mechanism that enforces proper test coverage for each type added through the framework

Was this patch authored or co-authored using generative AI tooling?

Co-authored with: claude-opus-4-6

@davidm-db davidm-db marked this pull request as ready for review February 9, 2026 09:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant