fix(arrow): iceberg writer improvements #713

badalprasadsingh · 2025-12-30T07:52:28Z

Description

Fixes:

now() in partition transformations
int to long case
integration test

Type of change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
This change requires a documentation update

How Has This Been Tested?

REST + Minio + Spark

Screenshots or Recordings

Documentation

Documentation Link: [link to README, olake.io/docs, or olake-docs]
N/A (bug fix, refactor, or test changes only)

Signed-off-by: badalprasadsingh <badal@datazip.io>

hash-data · 2026-01-06T13:16:03Z

destination/iceberg/arrow-writer/transforms.go

+		v := val.(int32)
+		return fmt.Sprintf("%d", v), v, nil
+	case "long":
+		v := val.(int64)
+		return fmt.Sprintf("%d", v), v, nil
+	case "float":
+		v := val.(float32)
+		return fmt.Sprintf("%g", v), v, nil
+	case "double":
+		v := val.(float64)
+		return fmt.Sprintf("%g", v), v, nil
+	case "string":


any reason to typecast and then convert? diff between prev and current aproach ?

previously we used to type case it towards the java side if you remember in a java function if you remember that's not good

now we do it towards the exact go side in transforms logic and send over proto in its exact type in the java server

hash-data · 2026-01-08T07:56:06Z

destination/iceberg/arrow-writer/writer.go

 		}

-		w.writers.Delete(fileType + ":" + partitionKey)
+		delete(w.writers, fileType+":"+partitionKey)


for ":" let us create a function to get fileKey

hash-data · 2026-01-08T09:37:37Z

destination/iceberg/arrow-writer/writer.go

+			var pValues []any
+			if len(w.partitionInfo) != 0 {
+				values, err := w.getRecordPartitionValues(rec, olakeTimestamp)
+				if err != nil {
+					return nil, nil, err
+				}


let us discuss the reason for adding it

from transforms logic, we need two types of transform values:
(a) for generating the partition key for file path, e.g., 2009-11 (kind of human readable string format), and,
(b) partition value to add in the iceberg table, e.g., 478 which goes to the manifests

so, technically, this code snippet helps it to extract the partition values only once per partition

hash-data · 2026-01-08T09:53:40Z

destination/iceberg/arrow-writer/writer.go

-		if closeErr := writer.currentWriter.Close(); closeErr != nil {
-			err = fmt.Errorf("failed to close writer: %s", closeErr)
-			return false
+	for mapKey, writer := range w.writers {


Suggested change

for mapKey, writer := range w.writers {

for pKey, writer := range w.writers {

hash-data · 2026-01-08T09:56:55Z

destination/iceberg/arrow-writer/transforms.go

+			pv.Value = &proto.ArrowPayload_FileMetadata_PartitionValue_StringValue{StringValue: v}
+		case bool:
+			// Booleans stored as string "true"/"false" per Iceberg convention
+			pv.Value = &proto.ArrowPayload_FileMetadata_PartitionValue_StringValue{StringValue: fmt.Sprintf("%t", v)}


what about timestamp as partition value?

not required.

hash-data · 2026-01-08T10:03:15Z

...java-writer/src/main/java/io/debezium/server/iceberg/tableoperator/IcebergTableOperator.java


     public void accumulateDeleteFiles(String threadId, Table table, String filePath, int equalityFieldId,
-               long recordCount, List<String> partitionValues) {
-          if (table == null) {


any reason ?

well, I couldn't personally think of a case where we will receive the table to null. As this is called during register and commit proto case, and we always either load or create the table before moving with any iceberg writer

hash-data · 2026-01-08T10:03:49Z

...java-writer/src/main/java/io/debezium/server/iceberg/tableoperator/IcebergTableOperator.java

+                    case LONG_VALUE -> protoValue.getLongValue();
+                    case FLOAT_VALUE -> protoValue.getFloatValue();
+                    case DOUBLE_VALUE -> protoValue.getDoubleValue();
+                    case STRING_VALUE -> protoValue.getStringValue();


timestamp not available

we don't need to consider the timestamp value as we are adding these as partition values to iceberg and timestamp value gets converted into its required int type based on iceberg transformations like day, hour, etc.

hash-data · 2026-01-08T10:08:27Z

utils/testutils/test_utils.go

+								for _, useArrowWriter := range []bool{false, true} {
+									writerType := utils.Ternary(useArrowWriter, "Arrow", "Legacy").(string)
+									t.Run(fmt.Sprintf("Iceberg (%s) Full load + CDC tests", writerType), func(t *testing.T) {
+										if err := cfg.testIcebergWriter(ctx, t, c, currentTestTable, useArrowWriter, cfg.testIcebergFullLoadAndCDC); err != nil {
+											t.Fatalf("Iceberg (%s) Full load + CDC tests failed: %v", writerType, err)
+										}
+									})


looks little weired can you check other way

hash-data · 2026-01-08T10:09:00Z

utils/typeutils/datatype.go

 		return types.Int32
-	case reflect.Int64, reflect.Uint64:
+	// on standard 64 bit systems, golang's int type is 64 bits
+	case reflect.Int, reflect.Int64, reflect.Uint, reflect.Uint64:


how we checked this?

well, according to the test script which produced this error we used to add 99991199 as a value.

a := 99991199 // var a int r2 := types.CreateRawRecord("2", map[string]any{"name": "rohan", "age": 1, "contact": 99991199, "email": nil}, "c", nil)

now, an int type can be both int32 and int64 in go, but in certain systems having it in int64 makes it faster. Check .

The int, uint, and uintptr types are implementation-specific. On 64-bit systems, int is 64 bits.

we didn't face this issue on legacy writer as the java engine there itself used to create parquet files, here we first create arrow memory vector buffers (basically schema) and then write the value. Thus, receiving this error:

org.apache.iceberg.arrow.vectorized.VectorizedArrowReader.allocateVectorBasedOnOriginalType(VectorizedArrowReader.java:279)

Signed-off-by: badalprasadsingh <badal@datazip.io>

badalprasadsingh · 2026-01-09T14:16:20Z

Currently in arrow-writer both _olake_timestamp and _cdc_timestamp are in microsecond precision.

badalprasadsingh added 2 commits December 30, 2025 13:10

chore: transforms

95f626b

Signed-off-by: badalprasadsingh <badal@datazip.io>

add: integration tests

5e4bccd

Signed-off-by: badalprasadsingh <badal@datazip.io>

badalprasadsingh changed the title ~~fix(Arrow): iceberg writer improvements~~ fix(arrow): iceberg writer improvements Dec 30, 2025

badalprasadsingh added 2 commits December 30, 2025 13:26

fix: minor lint issue

6efa888

Signed-off-by: badalprasadsingh <badal@datazip.io>

fix: now() in partition regex

b852156

Signed-off-by: badalprasadsingh <badal@datazip.io>

badalprasadsingh marked this pull request as draft December 30, 2025 08:02

badalprasadsingh added 7 commits December 30, 2025 17:22

chore: minor java side refractoring

991f96e

Signed-off-by: badalprasadsingh <badal@datazip.io>

chore: iceberg table operator

e7ff7bb

Signed-off-by: badalprasadsingh <badal@datazip.io>

fix(datatype): use int64 for golang's int and uint types

2b79ad1

Signed-off-by: badalprasadsingh <badal@datazip.io>

fix: unit tests

9d76efd

Signed-off-by: badalprasadsingh <badal@datazip.io>

fix: unit tests

794cb3a

Signed-off-by: badalprasadsingh <badal@datazip.io>

fix: minor for now() based partition

0a0c775

Signed-off-by: badalprasadsingh <badal@datazip.io>

fix: integration test

b499860

Signed-off-by: badalprasadsingh <badal@datazip.io>

badalprasadsingh marked this pull request as ready for review January 2, 2026 21:37

badalprasadsingh added 4 commits January 3, 2026 15:08

Merge branch 'staging' into feat/arrow-improvements

846e717

fix: refractoring

94453cc

Signed-off-by: badalprasadsingh <badal@datazip.io>

fix: minor lint issues

64fad84

Signed-off-by: badalprasadsingh <badal@datazip.io>

Merge branch 'staging' into feat/arrow-improvements

fb1aeb1

hash-data reviewed Jan 8, 2026

View reviewed changes

badalprasadsingh added 2 commits January 9, 2026 14:51

minor refractoring

7fffe01

Signed-off-by: badalprasadsingh <badal@datazip.io>

fix: minor lint issues

6dd26a9

Signed-off-by: badalprasadsingh <badal@datazip.io>

	for mapKey, writer := range w.writers {
	for pKey, writer := range w.writers {

fix(arrow): iceberg writer improvements #713

Are you sure you want to change the base?

fix(arrow): iceberg writer improvements #713

Uh oh!

Conversation

badalprasadsingh commented Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

How Has This Been Tested?

Screenshots or Recordings

Documentation

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

badalprasadsingh commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

badalprasadsingh commented Dec 30, 2025 •

edited

Loading

badalprasadsingh commented Jan 9, 2026 •

edited

Loading