Conversation
| * @param codecName the compression codec to use by default | ||
| * @return this builder for method chaining. | ||
| */ | ||
| public Builder withCompressionCodec(CompressionCodecName codecName) { |
There was a problem hiding this comment.
There isn't an existing API to set this? I have to look more closely at the the convention but would withDefaultCompressionCodec make sense?
There was a problem hiding this comment.
There is an existing api at the Parquet writer builder class: https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetWriter.java#L567
However there is not an api within this properties class, which we use this class in a couple of places to pass the column's compression.
In terms of the naming I am ok with that suggestion, however I noticed the pattern for the APIs to not prefix with Default, even when setting a default value: https://github.com/apache/parquet-java/blob/master/parquet-column/src/main/java/org/apache/parquet/column/ParquetProperties.java#L484
| BytesInputCompressor compressor = codecFactory.getCompressor(props.getCompressionCodec(path)); | ||
| writers.put( | ||
| path, | ||
| new ColumnChunkPageWriter( |
There was a problem hiding this comment.
Is this copy and paste from other constructors, I wonder if there is some refactoring that can be done to avoid duplication? (I wonder if we should have a ColumnChunkPageWriterBuilder?
| fileEncryptor, | ||
| rowGroupOrdinal); | ||
| ColumnChunkPageWriteStore columnChunkPageWriteStore; | ||
| if (codecFactory != null) { |
There was a problem hiding this comment.
Is it possible to create a default codecFactory to avoid the if/else block below?
There was a problem hiding this comment.
In my mind I thought the if/else would be simpler, since there are callers that will not be providing/using the codec factory so not sure if having the default codec there would make sense, such as the following:
https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetRecordWriter.java#L106
Let me know though if you think we should still pursue the default code factory approach.
| } | ||
|
|
||
| @Test | ||
| public void testPerColumnCompression() throws Exception { |
There was a problem hiding this comment.
does this test anything that the next method does not?
There was a problem hiding this comment.
You are correct the second test offers the same coverage, with additonal compression codecs being tested. I can remove this test.
| * @param validating if schema validation should be turned on | ||
| * @param props parquet encoding properties | ||
| */ | ||
| ParquetRecordWriter( |
There was a problem hiding this comment.
It seems like this should be deprecated, and new method without codec passed in should be exposed instead?
There was a problem hiding this comment.
I think you are correct, since the callers now will pass a ParquetProperties props is using the compression codec builder method withCompressionCodec, so I think we can expose a new constructor without CompressionCodecName codec itself.
| this.codecFactory = new CodecFactory(conf, props.getPageSizeThreshold()); | ||
| // Ensure the default compression codec from ParquetOutputFormat is set in props | ||
| ParquetProperties propsWithCodec = | ||
| ParquetProperties.copy(props).withCompressionCodec(codec).build(); |
There was a problem hiding this comment.
does risk overwriting an already set compression codec?
emkornfield
left a comment
There was a problem hiding this comment.
Took a first pass not very familiar with this code, but I'd also expect potentially more test coverage given the number of classes changed?
Will add more coverage |

Rationale for this change
Issue Raised here: apache/parquet-format#553
The Parquet spec already supports per-column compression, each column chunk stores its own CompressionCodecName in the footer metadata. However, the parquet-java writer API currently forces a single compression codec for all columns in a file. This PR address that gap by exposing per-column compression configuration through the existing ColumnProperty infrastructure.
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?
Two new public APIs are introduced:
cc @julienledem @emkornfield