[SPARK-55440][SQL] Types Framework - Phase 1a - Core Type System Foundation #54223
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
This PR introduces the foundation of the Spark Types Framework - a system for centralizing type-specific operations that are currently scattered across 50+ files using diverse patterns.
Framework interfaces (9 new files in
sql/apiandsql/catalyst):TypeOps(catalyst),TypeApiOps(sql-api)PhyTypeOps(physical type representation),LiteralTypeOps(literal creation),ExternalTypeOps(internal/external type conversion),FormatTypeOps(string formatting),EncodeTypeOps(row encoders)TimeTypeOps+TimeTypeApiOpsfor TimeTypeIntegration points (10 existing files modified with check-and-delegate pattern):
PhysicalDataType.scala- physical type dispatchCatalystTypeConverters.scala- external/internal type conversion (viaTypeOpsConverteradapter)ToStringBase.scala- string formattingRowEncoder.scala- row encodingliterals.scala- default literal creation (Literal.default)EncoderUtils.scala- encoder Java class mappingCodeGenerator.scala- codegen Java class mappingSpecificInternalRow.scala- mutable value creationInternalRow.scala- row writer dispatchFeature flag:
spark.sql.types.framework.enabled(defaults totruein tests viaUtils.isTesting,falseotherwise), configured inSQLConf.scala+SqlApiConf.scala.Interface hierarchy:
The split across
sql/apiandsql/catalystfollows existing Spark module separation -TypeApiOpslives insql/apifor client-side operations that depend onAgnosticEncoder, whileTypeOpslives insql/catalystfor server-side operations that depend onInternalRow,PhysicalDataType, etc.All integration points use a check-and-delegate pattern at the beginning of existing match statements:
This is the first of several planned PRs. Subsequent PRs will add client-side integrations (Spark Connect proto, Arrow SerDe, JDBC, Python, Thrift) and storage format integrations (Parquet, ORC, CSV, JSON, etc.).
Why are the changes needed?
Adding a new data type to Spark currently requires modifying 50+ files with scattered type-specific logic. Each file has its own conventions, and there is no compiler assistance to ensure completeness. Integration points are non-obvious and easy to miss - patterns include
_: TimeTypein Scala pattern matching,TimeNanoVectorin Arrow SerDe,.hasTime()/.getTime()in proto fields,LocalTimeEncoderin encoder helpers,java.sql.Types.TIMEin JDBC,instanceof TimeTypein Java files, and compound matches likecase LongType | ... | _: TimeType =>that are invisible to naive searches.The framework centralizes type-specific infrastructure operations in Ops interface classes. When adding a new type with the framework in place, a developer creates two Ops classes (one in
sql/api, one insql/catalyst) and registers them in the corresponding factory objects. The compiler enforces that all required interface methods are implemented, significantly reducing the risk of missing integration points.Concrete example - TimeType: TimeType has integration points spread across 50+ files using the diverse patterns listed above (physical type mapping, literals, type converters, encoders, formatters, Arrow SerDe, proto conversion, JDBC, Python, Thrift, storage formats). With the framework, these are consolidated into two Ops classes:
TimeTypeOps(~150 lines) andTimeTypeApiOps(~90 lines). A developer adding a new type with similar complexity would create two analogous files instead of touching 50+ files. The framework does not cover type-specific expressions (e.g.,CurrentTime,TimeAddInterval) or SQL parser changes, which are inherently type-specific - it provides the primitives those build on.This PR covers the core infrastructure integration. Subsequent PRs will add client-side integrations (Spark Connect proto, Arrow SerDe, JDBC, Python, Thrift) and storage format integrations (Parquet, ORC, CSV, JSON, etc.).
Does this PR introduce any user-facing change?
No. This is an internal refactoring behind a feature flag (
spark.sql.types.framework.enabled). When the flag is enabled, framework-supported types use centralized Ops dispatch instead of direct pattern matching. Behavior is identical in both paths. The flag defaults totruein tests andfalseotherwise.How was this patch tested?
The framework is a refactoring of existing dispatch logic - it changes the mechanism but preserves identical behavior. The feature flag is enabled by default in test environments (
Utils.isTesting), so the entire existing test suite validates the framework code path. No new tests are added in this PR because the framework delegates to the same underlying logic that existing tests already cover.In subsequent phases, the testing focus will be on:
Was this patch authored or co-authored using generative AI tooling?
Co-authored with: claude-opus-4-6