Implement per column compression by rahil-c · Pull Request #3396 · apache/parquet-java

rahil-c · 2026-02-16T23:42:48Z

Rationale for this change

Issue Raised here: apache/parquet-format#553

The Parquet spec already supports per-column compression, each column chunk stores its own CompressionCodecName in the footer metadata. However, the parquet-java writer API currently forces a single compression codec for all columns in a file. This PR address that gap by exposing per-column compression configuration through the existing ColumnProperty infrastructure.

What changes are included in this PR?

ParquetProperties: Added ColumnProperty following the same pattern used for dictionary encoding, bloom filters.
ColumnChunkPageWriteStore: Added a new constructor that accepts CompressionCodecFactory + ParquetProperties
InternalParquetRecordWriter: Added a new constructor accepting CompressionCodecFactory instead of a single BytesInputCompressor.
ParquetWriter: Added withCompressionCodec(String,CompressionCodecName) builder method. Updated the core constructor to pass the CompressionCodecFactory through to the writer stack.
ParquetOutputFormat: Added ColumnConfigParser entry so per-column compression can be configured via Hadoop config keys (parquet.compression#=CODEC).
ParquetRecordWriter: Updated to pass CompressionCodecFactory to InternalParquetRecordWriter.

Are these changes tested?

Added test within this pr

Are there any user-facing changes?

Two new public APIs are introduced:

  ParquetWriter.builder(path)
      .withCompressionCodec(CompressionCodecName.SNAPPY)
         // default for all columns
      .withCompressionCodec("embeddings",
  CompressionCodecName.UNCOMPRESSED)  // per-column override
      .build();

  Hadoop configuration (new key pattern):
  parquet.compression#<column_path>=<CODEC_NAME>

cc @julienledem @emkornfield

emkornfield · 2026-02-22T01:20:28Z

parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java

+     * @param codecName the compression codec to use by default
+     * @return this builder for method chaining.
+     */
+    public Builder withCompressionCodec(CompressionCodecName codecName) {


There isn't an existing API to set this? I have to look more closely at the the convention but would withDefaultCompressionCodec make sense?

There is an existing api at the Parquet writer builder class: https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetWriter.java#L567
However there is not an api within this properties class, which we use this class in a couple of places to pass the column's compression.

In terms of the naming I am ok with that suggestion, however I noticed the pattern for the APIs to not prefix with Default, even when setting a default value: https://github.com/apache/parquet-java/blob/master/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L484

emkornfield · 2026-02-22T01:23:27Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ColumnChunkPageWriteStore.java

+        BytesInputCompressor compressor = codecFactory.getCompressor(props.getCompressionCodec(path));
+        writers.put(
+            path,
+            new ColumnChunkPageWriter(


Is this copy and paste from other constructors, I wonder if there is some refactoring that can be done to avoid duplication? (I wonder if we should have a ColumnChunkPageWriterBuilder?

It looks like we currently do not have a ColumnChunkPageWriterBuilder, if that is the ideal pattern i can look into adding this builder.

emkornfield · 2026-02-22T01:25:11Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/InternalParquetRecordWriter.java

-        fileEncryptor,
-        rowGroupOrdinal);
+    ColumnChunkPageWriteStore columnChunkPageWriteStore;
+    if (codecFactory != null) {


Is it possible to create a default codecFactory to avoid the if/else block below?

In my mind I thought the if/else would be simpler, since there are callers that will not be providing/using the codec factory so not sure if having the default codec there would make sense, such as the following:
https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetRecordWriter.java#L106

Let me know though if you think we should still pursue the default code factory approach.

emkornfield · 2026-02-22T01:29:50Z

parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestParquetWriter.java

  }
+
+  @Test
+  public void testPerColumnCompression() throws Exception {


does this test anything that the next method does not?

You are correct the second test offers the same coverage, with additonal compression codecs being tested. I can remove this test.

emkornfield · 2026-02-22T01:32:00Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetRecordWriter.java

   * @param validating    if schema validation should be turned on
   * @param props         parquet encoding properties
   */
  ParquetRecordWriter(


It seems like this should be deprecated, and new method without codec passed in should be exposed instead?

I think you are correct, since the callers now will pass a ParquetProperties props is using the compression codec builder method withCompressionCodec, so I think we can expose a new constructor without CompressionCodecName codec itself.

emkornfield · 2026-02-22T01:32:53Z

parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetRecordWriter.java

    this.codecFactory = new CodecFactory(conf, props.getPageSizeThreshold());
+    // Ensure the default compression codec from ParquetOutputFormat is set in props
+    ParquetProperties propsWithCodec =
+        ParquetProperties.copy(props).withCompressionCodec(codec).build();


does risk overwriting an already set compression codec?

emkornfield

Took a first pass not very familiar with this code, but I'd also expect potentially more test coverage given the number of classes changed?

rahil-c · 2026-02-22T20:28:14Z

Took a first pass not very familiar with this code, but I'd also expect potentially more test coverage given the number of classes changed?

Will add more coverage

[draft] Implement per column compression

a1c1d45

rahil-c marked this pull request as ready for review February 17, 2026 02:44

rahil-c changed the title ~~[draft] Implement per column compression~~ Implement per column compression Feb 17, 2026

emkornfield reviewed Feb 22, 2026

View reviewed changes

Comments

Conversation

rahil-c commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

emkornfield Feb 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

emkornfield left a comment

Choose a reason for hiding this comment

Uh oh!

rahil-c commented Feb 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rahil-c commented Feb 16, 2026 •

edited

Loading

emkornfield Feb 22, 2026 •

edited

Loading

rahil-c commented Feb 22, 2026 •

edited

Loading