Skip to content

Comments

Implement per column compression#3396

Open
rahil-c wants to merge 1 commit intoapache:masterfrom
rahil-c:rahil/per-column-compression
Open

Implement per column compression#3396
rahil-c wants to merge 1 commit intoapache:masterfrom
rahil-c:rahil/per-column-compression

Conversation

@rahil-c
Copy link

@rahil-c rahil-c commented Feb 16, 2026

Rationale for this change

Issue Raised here: apache/parquet-format#553

The Parquet spec already supports per-column compression, each column chunk stores its own CompressionCodecName in the footer metadata. However, the parquet-java writer API currently forces a single compression codec for all columns in a file. This PR address that gap by exposing per-column compression configuration through the existing ColumnProperty infrastructure.

What changes are included in this PR?

  • ParquetProperties: Added ColumnProperty following the same pattern used for dictionary encoding, bloom filters.
  • ColumnChunkPageWriteStore: Added a new constructor that accepts CompressionCodecFactory + ParquetProperties
  • InternalParquetRecordWriter: Added a new constructor accepting CompressionCodecFactory instead of a single BytesInputCompressor.
  • ParquetWriter: Added withCompressionCodec(String,CompressionCodecName) builder method. Updated the core constructor to pass the CompressionCodecFactory through to the writer stack.
  • ParquetOutputFormat: Added ColumnConfigParser entry so per-column compression can be configured via Hadoop config keys (parquet.compression#=CODEC).
  • ParquetRecordWriter: Updated to pass CompressionCodecFactory to InternalParquetRecordWriter.

Are these changes tested?

  • Added test within this pr

Are there any user-facing changes?

Two new public APIs are introduced:

  ParquetWriter.builder(path)
      .withCompressionCodec(CompressionCodecName.SNAPPY)
         // default for all columns
      .withCompressionCodec("embeddings",
  CompressionCodecName.UNCOMPRESSED)  // per-column override
      .build();

  Hadoop configuration (new key pattern):
  parquet.compression#<column_path>=<CODEC_NAME>

cc @julienledem @emkornfield

@rahil-c rahil-c marked this pull request as ready for review February 17, 2026 02:44
@rahil-c rahil-c changed the title [draft] Implement per column compression Implement per column compression Feb 17, 2026
* @param codecName the compression codec to use by default
* @return this builder for method chaining.
*/
public Builder withCompressionCodec(CompressionCodecName codecName) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't an existing API to set this? I have to look more closely at the the convention but would withDefaultCompressionCodec make sense?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an existing api at the Parquet writer builder class: https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetWriter.java#L567
However there is not an api within this properties class, which we use this class in a couple of places to pass the column's compression.

In terms of the naming I am ok with that suggestion, however I noticed the pattern for the APIs to not prefix with Default, even when setting a default value: https://github.com/apache/parquet-java/blob/master/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L484

BytesInputCompressor compressor = codecFactory.getCompressor(props.getCompressionCodec(path));
writers.put(
path,
new ColumnChunkPageWriter(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this copy and paste from other constructors, I wonder if there is some refactoring that can be done to avoid duplication? (I wonder if we should have a ColumnChunkPageWriterBuilder?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we currently do not have a ColumnChunkPageWriterBuilder, if that is the ideal pattern i can look into adding this builder.
Screenshot 2026-02-22 at 8 55 28 AM

fileEncryptor,
rowGroupOrdinal);
ColumnChunkPageWriteStore columnChunkPageWriteStore;
if (codecFactory != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to create a default codecFactory to avoid the if/else block below?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my mind I thought the if/else would be simpler, since there are callers that will not be providing/using the codec factory so not sure if having the default codec there would make sense, such as the following:
https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetRecordWriter.java#L106

Let me know though if you think we should still pursue the default code factory approach.

}

@Test
public void testPerColumnCompression() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this test anything that the next method does not?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct the second test offers the same coverage, with additonal compression codecs being tested. I can remove this test.

* @param validating if schema validation should be turned on
* @param props parquet encoding properties
*/
ParquetRecordWriter(
Copy link
Contributor

@emkornfield emkornfield Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this should be deprecated, and new method without codec passed in should be exposed instead?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are correct, since the callers now will pass a ParquetProperties props is using the compression codec builder method withCompressionCodec, so I think we can expose a new constructor without CompressionCodecName codec itself.

this.codecFactory = new CodecFactory(conf, props.getPageSizeThreshold());
// Ensure the default compression codec from ParquetOutputFormat is set in props
ParquetProperties propsWithCodec =
ParquetProperties.copy(props).withCompressionCodec(codec).build();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does risk overwriting an already set compression codec?

Copy link
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a first pass not very familiar with this code, but I'd also expect potentially more test coverage given the number of classes changed?

@rahil-c
Copy link
Author

rahil-c commented Feb 22, 2026

Took a first pass not very familiar with this code, but I'd also expect potentially more test coverage given the number of classes changed?

Will add more coverage

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants