Core: Interface based DataFile reader and writer API #12298

pvary · 2025-02-17T15:03:50Z

Here is what the PR does:

Created 3 interface classes which are implemented by the file formats:
- ReadBuilder - Builder for reading data from data files
- AppenderBuilder - Builder for writing data to data files
- ObjectModel - Providing ReadBuilders, and AppenderBuilders for the specific data file format and object model pair
Updated the Parquet, Avro, ORC implementation for this interfaces, and deprecated the old reader/writer APIs
Created interface classes which will be used by the actual readers/writers of the data files:
- AppenderBuilder - Builder for writing a file
- DataWriterBuilder - Builder for generating a data file
- PositionDeleteWriterBuilder - Builder for generating a position delete file
- EqualityDeleteWriterBuilder - Builder for generating an equality delete file
- No ReadBuilder here - the file format reader builder is reused
Created a WriterBuilder class which implements the interfaces above (AppenderBuilder/DataWriterBuilder/PositionDeleteWriterBuilder/EqualityDeleteWriterBuilder) based on a provided file format specific AppenderBuilder
Created an ObjectModelRegistry which stores the available ObjectModels, and engines and users could request the readers (ReadBuilder) and writers (AppenderBuilder/DataWriterBuilder/PositionDeleteWriterBuilder/EqualityDeleteWriterBuilder) from.
Created the appropriate ObjectModels:
- GenericObjectModels - for reading and writing Iceberg Records
- SparkObjectModels - for reading (vectorized and non-vectorized) and writing Spark InternalRow/ColumnarBatch objects
- FlinkObjectModels - for reading and writing Flink RowData objects
- An arrow object model is also registered for vectorized reads of Parquet files into Arrow ColumnarBatch objects
Updated the production code where the reading and writing happens to use the ObjectModelRegistry and the new reader/writer interfaces to access data files
Kept the testing code intact to ensure that the new API/code is not breaking anything

core/src/main/java/org/apache/iceberg/io/datafile/DataFileServiceRegistry.java

core/src/main/java/org/apache/iceberg/io/datafile/AppenderBuilder.java

data/src/main/java/org/apache/iceberg/data/GenericReader.java

liurenjie1024

Thanks @pvary for this proposal, I left some comments.

core/src/main/java/org/apache/iceberg/io/datafile/ReaderBuilder.java

core/src/main/java/org/apache/iceberg/io/datafile/DataFileServiceRegistry.java

pvary · 2025-02-21T14:10:03Z

I will start to collect the differences here between the different writer types (appender/dataWriter/equalityDeleteWriter/positionalDeleteWriter) for reference:

Writer context is different between delete and data files. This contains TableProperties/Configurations which could be different between delete and data files. For example for parquet: RowGroupSize/PageSize/PageRowLimit/DictSize/Compression etc. For ORC and Avro we have some similar changing configs
Specific writer functions for position deletes to write out the PositionDelete records
Positional delete PathTransformFunction to convert writer data type for the path to file format data type

core/src/main/java/org/apache/iceberg/io/datafile/DataWriterBuilder.java

core/src/main/java/org/apache/iceberg/avro/Avro.java

core/src/main/java/org/apache/iceberg/io/datafile/DataFileServiceRegistry.java

...urces/META-INF/services/org.apache.iceberg.io.datafile.DataFileServiceRegistry$WriterService

rdblue · 2025-02-22T00:10:36Z

While I think the goal here is a good one, the implementation looks too complex to be workable in its current form.

The primary issue that we currently have is adapting object models (like Iceber's internal StructLike, Spark's InternalRow, or Flink's RowData) to file formats so that you can separately write object model to format glue code and have it work throughout support for an engine. I think a diff from the InternalData PR demonstrates it pretty well:

-    switch (format) {
-      case AVRO:
-        AvroIterable<ManifestEntry<F>> reader =
-            Avro.read(file)
-                .project(ManifestEntry.wrapFileSchema(Types.StructType.of(fields)))
-                .createResolvingReader(this::newReader)
-                .reuseContainers()
-                .build();
+    CloseableIterable<ManifestEntry<F>> reader =
+        InternalData.read(format, file)
+            .project(ManifestEntry.wrapFileSchema(Types.StructType.of(fields)))
+            .reuseContainers()
+            .build();
 
-        addCloseable(reader);
+    addCloseable(reader);
 
-        return CloseableIterable.transform(reader, inheritableMetadata::apply);
+    return CloseableIterable.transform(reader, inheritableMetadata::apply);
-
-      default:
-        throw new UnsupportedOperationException("Invalid format for manifest file: " + format);
-    }

This shows:

Rather than a switch, the format is passed to create the builder
There is no longer a callback passed to create readers for the object model (createResolvingReader)

In this PR, there are a lot of other changes as well. I'm looking at one of the simpler Spark cases in the row reader.

The builder is initialized from DataFileServiceRegistry and now requires a format, class name, file, projection, and constant map:

    return DataFileServiceRegistry.readerBuilder(
            format, InternalRow.class.getName(), file, projection, idToConstant)

There are also new static classes in the file. Each creates a new service and each service creates the builder and object model:

  public static class AvroReaderService implements DataFileServiceRegistry.ReaderService {
    @Override
    public DataFileServiceRegistry.Key key() {
      return new DataFileServiceRegistry.Key(FileFormat.AVRO, InternalRow.class.getName());
    }

    @Override
    public ReaderBuilder builder(
        InputFile inputFile,
        Schema readSchema,
        Map<Integer, ?> idToConstant,
        DeleteFilter<?> deleteFilter) {
      return Avro.read(inputFile)
          .project(readSchema)
          .createResolvingReader(schema -> SparkPlannedAvroReader.create(schema, idToConstant));
    }

The createResolvingReader line is still there, just moved into its own service class instead of in branches of a switch statement.

In addition, there are now a lot more abstractions:

A builder for creating an appender for a file format
A builder for creating a data file writer for a file format
A builder for creating an equality delete writer for a file format
A builder for creating a position delete writer for a file format
A builder for creating a reader for a file format
A "service" registry (what is a service?)
A "key"
A writer service
A reader service

I think that the next steps are to focus on making this a lot simpler, and there are some good ways to do that:

Focus on removing boilerplate and hiding the internals. For instance, Key, if needed, should be an internal abstraction and not complexity that is exposed to callers
The format-specific data and delete file builders typically wrap an appender builder. Is there a way to handle just the reader builder and appender builder?
Is the extra "service" abstraction helpful?
Remove ServiceLoader and use a simpler solution. I think that formats could simply register themselves like we do for InternalData. I think it would be fine to have a trade-off that Iceberg ships with a list of known formats that can be loaded, and if you want to replace that list it's at your own risk.
Standardize more across the builders for FileFormat. How idToConstant is handled is a good example. That should be passed to the builder instead of making the whole API more complicated. Projection is the same.

pvary · 2025-02-24T10:34:16Z

While I think the goal here is a good one, the implementation looks too complex to be workable in its current form.

I'm happy that we agree with the goals. I created a PR to start the conversation. If there are willing reviewers we can introduce more invasive changes to archive a better API. I'm all for it!

The primary issue that we currently have is adapting object models (like Iceber's internal StructLike, Spark's InternalRow, or Flink's RowData) to file formats so that you can separately write object model to format glue code and have it work throughout support for an engine.

I think we need to keep this direct transformations to prevent the performance loss which would be caused by multiple transformations between object model -> common model -> file format.

We have a matrix of transformation which we need to encode somewhere:

Source	Target
Parquet	StructLike
Parquet	InternalRow
Parquet	RowData
Parquet	Arrow
Avro	...
ORC	...

[..]

Rather than a switch, the format is passed to create the builder

There is no longer a callback passed to create readers for the object model (createResolvingReader)

The InternalData reader has one advantage over the data file readers/writers. The internal object model is static for these readers/writers. For the DataFile readers/writers we have multiple object models to handle.

[..]
I think that the next steps are to focus on making this a lot simpler, and there are some good ways to do that:

Focus on removing boilerplate and hiding the internals. For instance, Key, if needed, should be an internal abstraction and not complexity that is exposed to callers

If we allow adding new builders for the file formats we can remove a good chunk of the boilerplate code. Let me see how this would look like

The format-specific data and delete file builders typically wrap an appender builder. Is there a way to handle just the reader builder and appender builder?

We need to refactor the Avro positional delete write for this, or add a positionalWriterFunc. Also need to consider that the format specific configurations which are different for the appenders and the delete files (DELETE_PARQUET_ROW_GROUP_SIZE_BYTES vs. PARQUET_ROW_GROUP_SIZE_BYTES)

Is the extra "service" abstraction helpful?

If we are ok with having a new Builder for the readers/writers, then we don't need the service. It was needed to keep the current APIs and the new APIs compatible.

Remove ServiceLoader and use a simpler solution. I think that formats could simply register themselves like we do for InternalData. I think it would be fine to have a trade-off that Iceberg ships with a list of known formats that can be loaded, and if you want to replace that list it's at your own risk.

Will do

Standardize more across the builders for FileFormat. How idToConstant is handled is a good example. That should be passed to the builder instead of making the whole API more complicated. Projection is the same.

Will see what could be arcived

parquet/src/main/java/org/apache/iceberg/parquet/Parquet.java

rdblue · 2026-01-22T17:39:55Z

core/src/main/java/org/apache/iceberg/formats/ReadBuilder.java

+  /** Set the projection schema. */
+  ReadBuilder<D, S> project(Schema schema);
+
+  /** Sets the expected output schema. If not provided derived from the {@link #project(Schema)}. */


I don't think that this javadoc is very clear because of the reference to project. Calling project sets the Iceberg schema, not the engine schema and the builder won't do anything to create an engine schema from an Iceberg schema.

I think that this should be Sets the engine's representation of the projected schema.

We also do not want callers to use this in place of the Iceberg schema because this is opaque to core code -- Iceberg can't verify the engine schema or convert between the two projection representations. We should have a note that says this schema should match the requested Iceberg projection, but may differ in ways that Iceberg considers equivalent. For example, we may use this to exchange the engine's requested shredded representation for a variant, and we could also use this to pass things like specific classes to use for structs (like we have for our internal object model).

Also, what about only allowing this to be set if project is also set? I don't think it is necessary to do this, but I also can't think of a case where you might set an engine schema but not an Iceberg schema.

I think a good example of how this is used comes from the write path, where smallint may be passed to an int field. Here, the engine may want to use a bigint value even though the Iceberg schema is an int.

builder won't do anything to create an engine schema from an Iceberg schema.

I wanted to highlight that setting the outputSchema is not mandatory. If not provided the builder will generate a default representation.
I agree that better doc is needed.

I will reword and add the examples.

I think projection is required anyway, so I updated the javadoc accordingly.

Updated the javadoc. Please take another look.
Thanks!

core/src/main/java/org/apache/iceberg/formats/ReadBuilder.java

rdblue · 2026-01-22T17:43:24Z

core/src/main/java/org/apache/iceberg/formats/WriteBuilder.java

+   * Sets the input schema accepted by the writer. If not provided derived from the {@link
+   * #schema(Schema)}.
+   */
+  WriteBuilder<D, S> inputSchema(S schema);


I went into a bit of detail on the naming and javadoc for the equivalent method in the read builder. We should do the same things here:

Consider using engineSchema rather than "input"

Clarify the javadoc: this is the engine schema describing the rows passed to the writer, which may be more specific than the Iceberg schema (for example, tinyint, smallint, and int may be passed when the Iceberg type is int.

Updated the javadoc here too.
Please check

core/src/main/java/org/apache/iceberg/avro/AvroFormatModel.java

rdblue · 2026-01-22T17:51:37Z

core/src/main/java/org/apache/iceberg/formats/EqualityDeleteWriteBuilder.java

+
+  /**
+   * Sets the input schema accepted by the writer. If not provided derived from the {@link
+   * #rowSchema(Schema)}.


This isn't true, is it? How would the engine schema be derived from the row schema? The Iceberg schema can be derived from it, but this one can't.

Bad wording. The accepted representation will be derived from the schema.

Copied the javadoc from the WriteBuilder
Please check

rdblue · 2026-01-22T17:52:55Z

core/src/main/java/org/apache/iceberg/formats/EqualityDeleteWriteBuilder.java

+  EqualityDeleteWriteBuilder<D, S> rowSchema(Schema rowSchema);
+
+  /** Sets the equality field ids for the equality delete writer. */
+  default EqualityDeleteWriteBuilder<D, S> equalityFieldIds(List<Integer> fieldIds) {


Shouldn't this default go the other way? Usually we translate varargs versions into lists for the actual implementations.

In FileMetadata, equality fields are stored as int[], so I used the target type directly to avoid unnecessary conversions.

rdblue · 2026-01-22T17:53:39Z

core/src/main/java/org/apache/iceberg/formats/ModelWriteBuilder.java

+  WriteBuilder<D, S> withAADPrefix(ByteBuffer aadPrefix);
+
+  /** Finalizes the configuration and builds the {@link FileAppender}. */
+  FileAppender<D> build() throws IOException;


Why does this throw an IOException when the read path does not?

When file creation fails, ORC wraps the IOException in a RuntimeIOException. Parquet and Avro surface the original IOException. I chose to follow the Parquet and Avro behavior and leave this unchanged.

rdblue · 2026-01-22T17:55:03Z

core/src/main/java/org/apache/iceberg/formats/DataWriteBuilder.java

+   * Sets the input schema accepted by the writer. If not provided derived from the {@link
+   * #schema(Schema)}.
+   */
+  DataWriteBuilder<D, S> inputSchema(S schema);


This is not derived from the write schema, either. And this can be moved to the common write methods, right? I don't see a difference between the two.

Updated the name and the javadoc based on the comments on the WriteBuilder.
We can't move this to the common methods, as we don't have this in the PositionDeleteWriteBuilder

core/src/main/java/org/apache/iceberg/formats/EqualityDeleteWriteBuilder.java

…teBuilder interfaces

rdblue · 2026-01-31T00:14:23Z