Skip to content

Conversation

@ImDoubD-datazip
Copy link
Collaborator

@ImDoubD-datazip ImDoubD-datazip commented Dec 22, 2025

Description

DB2 LUW as source connector is added. As of now it supports 2 sync modes:

  • Full Refresh
  • Incremental

Chunking is either done via primary keys (if primary keys are present). Else RID based chunking is done (for tables with no primary keys).

Pre-Requisite

To run DB2 LUW, one needs IBM data server ODBC and CLI driver to be installed in the machine.

Steps to run DB2 LUW in your machine:

# IBM DB2 CLI environment
export IBM_DB_HOME=/pathto/clidriver
export PATH=$IBM_DB_HOME/bin:$PATH
export CGO_CFLAGS="-I$IBM_DB_HOME/include"
export CGO_LDFLAGS="-L$IBM_DB_HOME/lib -Wl,-rpath,$IBM_DB_HOME/lib"
export DYLD_LIBRARY_PATH=$IBM_DB_HOME/lib
  • Now try to run the discover command and sync.

In this PR, the base alpine image has been changed to debian:bookworm-slim for better db2 , other database drivers support.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

This has been tested on DB2 LUW VM Instance. Full refresh and incremental were tested in this.

Documentation

  • Documentation Link: [link to README, olake.io/docs, or olake-docs]
  • N/A (bug fix, refactor, or test changes only)

Related PR's (If Any):

TRIM(TABSCHEMA) AS table_schema,
TRIM(TABNAME) AS table_name
FROM SYSCAT.TABLES
WHERE TYPE IN ('T', 'V')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we select View also ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will ask product and do changes accordingly.

Comment on lines 73 to 77
err := d.client.QueryRowContext(ctx, existsQuery).Scan(&hasRows)

if err != nil {
return nil, fmt.Errorf("failed to check if table has rows: %s", err)
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
err := d.client.QueryRowContext(ctx, existsQuery).Scan(&hasRows)
if err != nil {
return nil, fmt.Errorf("failed to check if table has rows: %s", err)
}
err := d.client.QueryRowContext(ctx, existsQuery).Scan(&hasRows)
if err != nil {
return nil, fmt.Errorf("failed to check if table has rows: %s", err)
}

return chunks, nil
}
// split chunks via physical identifier RID()
splitViaRID := func(ctx context.Context, stream types.StreamInterface) (*types.Set[types.Chunk], error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it safe to use RID for chunking ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for non-primary key, I think we should use it because in table with no primary keys there is less chance of any column to be indexed. so for those kind of tables, it is better to use RID.

@vishalm0509
Copy link
Collaborator

Column type: DBCLOB

  • In the databae
Screenshot 2025-12-30 at 3 58 59 PM
  • In destination (Glue)
Screenshot 2025-12-30 at 3 59 14 PM

@vishalm0509
Copy link
Collaborator

vishalm0509 commented Dec 30, 2025

Column type: varbinary

  • "varbinary": types.String

  • Also column col_varbinary


  • In the database
    Screenshot 2025-12-30 at 4 12 10 PM

  • In the destination
    Screenshot 2025-12-30 at 4 11 25 PM

@vishalm0509
Copy link
Collaborator

  • Column name: col_long_vargraphic
  • Column type: LONG VARGRAPHIC
  • Not mapped in OLake

  • In the database
    Screenshot 2025-12-30 at 4 15 10 PM

  • In the destination

Screenshot 2025-12-30 at 4 15 22 PM

@vishalm0509
Copy link
Collaborator

Incremental test

Table: DB2_ALL_DATATYPES

"sync_mode": "incremental",
"cursor_field": "COL_TIMESTAMP:COL_TIME",
Screenshot 2025-12-30 at 4 56 59 PM

I checked for col_bigInt, it's working fine. We need to check for timestamp based cols similar to Oracle

switch v := cursorValue.(type) {
case time.Time:
if a.driver.Type() == string(constants.DB2) {
return v.Format("2006-01-02 15:04:05.000000")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't there a timestamp aware format for db2 ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the format we require for DB2 timestamp to be saved in state.

@vishalm0509
Copy link
Collaborator

Column type: DBCLOB

  • In the databae
Screenshot 2025-12-30 at 3 58 59 PM * In destination (Glue) Screenshot 2025-12-30 at 3 59 14 PM

This is still the case in Glue

@vishalm0509
Copy link
Collaborator

  • Column name: col_long_vargraphic

  • Column type: LONG VARGRAPHIC

  • Not mapped in OLake

  • In the database
    Screenshot 2025-12-30 at 4 15 10 PM

  • In the destination

Screenshot 2025-12-30 at 4 15 22 PM

Still not resolved.

@vishalm0509
Copy link
Collaborator

col_timestamp

  • Database: 2024-01-01-10.15.30.123456
  • Destination: 2024-01-01 10:15:30.123000 UTC

Please check this also

@vishalm0509
Copy link
Collaborator

Column: CHAR_ONE

  • Datatype: CHARACTER
  • Database: CHAR_ONE
  • Destination: "Q0hBUl9PTkUgIA=="

THis also

@vishalm0509
Copy link
Collaborator

col_time (also col_date)

  • Datatype: TIME
  • Databae: 10:15:30
  • Destination: 0001-01-01 10:15:30 +0000 UTC

"Same case with MSSQL (SQL Server) also"

Database: 14:20:30
Destination: 0001-01-01 14:20:30 +0553 LMT

@vishalm0509
Copy link
Collaborator

Incremental test

Table: DB2_ALL_DATATYPES

"sync_mode": "incremental",
"cursor_field": "COL_TIMESTAMP:COL_TIME",
Screenshot 2025-12-30 at 4 56 59 PM I checked for `col_bigInt`, it's working fine. We need to check for `timestamp` based cols similar to Oracle

This issue is still there

fi
;;
"Linux")
download_url="https://public.dhe.ibm.com/ibmdl/export/pub/software/data/db2/drivers/odbc_cli/linuxx64_odbc_cli.tar.gz"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So for linux arm64 don't exist. Is that why we skip the case?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

build.sh Outdated
Comment on lines 102 to 106
# Clean up any partial downloads from the failed go installer
rm -rf "$install_dir/clidriver" 2>/dev/null
rm -f "$install_dir"/*.tar.gz 2>/dev/null
rm -f "$install_dir"/*.zip 2>/dev/null

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just do curl. Keep it simple silly!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Collaborator

@vaibhav-datazip vaibhav-datazip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested for both full-refresh and incremental mode on OLake-CLI

  • using only primary cursor
  • using fallback cursor aswell
  • using float , int, string , timestamp as cursor values
  • filter using string, timestamp , int .

}

if hasRows {
return nil, fmt.Errorf("stats not populated for table[%s]. Please run command:\tRUNSTATS ON TABLE %s.%s WITH DISTRIBUTION AND DETAILED INDEXES ALL;\t to update table statistics", stream.ID(), stream.Namespace(), stream.Name())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of writing Please run command, you can mention Please run CLP command:


func (d *DB2) splitTableIntoChunks(ctx context.Context, stream types.StreamInterface) (*types.Set[types.Chunk], error) {
// split chunks via primary key
splitViaPrimaryKey := func(ctx context.Context, stream types.StreamInterface) (*types.Set[types.Chunk], error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tried syncing following table with 3 records

CREATE TABLE ALL_DB2_TYPES (
    COL_SMALLINT       SMALLINT,
    COL_INTEGER        INTEGER,
    COL_BIGINT         BIGINT,
    COL_DECIMAL        DECIMAL(10,2),
    COL_NUMERIC        NUMERIC(8,4),
    COL_REAL           REAL,
    COL_DOUBLE         DOUBLE,
    COL_DECFLOAT16     DECFLOAT(16),
    COL_DECFLOAT34     DECFLOAT(34),
    COL_CHAR10         CHAR(10),
    COL_VARCHAR50      VARCHAR(50),
    COL_VARGRAPHIC50   VARGRAPHIC(50),
    COL_GRAPHIC10      GRAPHIC(10),
    COL_LONGVARCHAR    LONG VARCHAR,
    COL_LONGVARGRAPHIC LONG VARGRAPHIC,
    COL_CHAR_BIT       CHAR(10) FOR BIT DATA,
    COL_VARCHAR_BIT    VARCHAR(20) FOR BIT DATA,
    COL_VARBINARY      VARBINARY(50),
    COL_DATE           DATE,
    COL_TIME           TIME,
    COL_TIMESTAMP      TIMESTAMP,
    COL_BOOLEAN        BOOLEAN,
    COL_CLOB           CLOB(1M),
    COL_DBCLOB         DBCLOB(500K),
    COL_BLOB           BLOB(500K),
    COL_XML            XML
);

I got the following error which syncing

2026-01-09T09:06:38Z DEBUG Starting backfill for DB2INST1.ALL_DB2_TYPES with chunk {6 7} using query: SELECT * FROM "DB2INST1"."ALL_DB2_TYPES" WHERE RID("DB2INST1"."ALL_DB2_TYPES") >= 6 AND RID("DB2INST1"."ALL_DB2_TYPES") < 7
2026-01-09T09:06:38Z INFO Sync completed, wait 5 seconds cleanup in progress...
2026-01-09T09:06:43Z FATAL error occurred while reading records: error occurred while waiting for connections: thread[DB2INST1.ALL_DB2_TYPES_01KEH03WW4FHZKVDDVHQFEKQ43]: failed to insert chunk min[%!s(int64=4)] and max[%!s(int64=6)] of stream DB2INST1.ALL_DB2_TYPES, insert func error: %!s(<nil>), thread error: failed to flush data while closing: failed to write records: failed to send batch: rpc error: code = Internal desc = grpc: error while marshaling: string field contains invalid UTF-8

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data in database is invalid utf-8 vlues

}

func (d *DB2) splitTableIntoChunks(ctx context.Context, stream types.StreamInterface) (*types.Set[types.Chunk], error) {
// split chunks via primary key
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with 0 records I tried syncing the following table

CREATE TABLE ALL_DB2_TYPES (
    COL_SMALLINT       SMALLINT,
    COL_INTEGER        INTEGER,
    COL_BIGINT         BIGINT,
    COL_DECIMAL        DECIMAL(10,2),
    COL_NUMERIC        NUMERIC(8,4),
    COL_REAL           REAL,
    COL_DOUBLE         DOUBLE,
    COL_DECFLOAT16     DECFLOAT(16),
    COL_DECFLOAT34     DECFLOAT(34),
    COL_CHAR10         CHAR(10),
    COL_VARCHAR50      VARCHAR(50),
    COL_VARGRAPHIC50   VARGRAPHIC(50),
    COL_GRAPHIC10      GRAPHIC(10),
    COL_LONGVARCHAR    LONG VARCHAR,
    COL_LONGVARGRAPHIC LONG VARGRAPHIC,
    COL_CHAR_BIT       CHAR(10) FOR BIT DATA,
    COL_VARCHAR_BIT    VARCHAR(20) FOR BIT DATA,
    COL_VARBINARY      VARBINARY(50),
    COL_DATE           DATE,
    COL_TIME           TIME,
    COL_TIMESTAMP      TIMESTAMP,
    COL_BOOLEAN        BOOLEAN,
    COL_CLOB           CLOB(1M),
    COL_DBCLOB         DBCLOB(500K),
    COL_BLOB           BLOB(500K),
    COL_XML            XML
);

got this error

2026-01-09T08:56:32Z INFO Sync completed, wait 5 seconds cleanup in progress...
2026-01-09T08:56:37Z FATAL error occurred while reading records: error occurred while waiting for context groups: failed to get or split chunks: failed to get the min and max rid: sql: Scan error on column index 0, name "1": converting NULL to int64 is unsupported

Comment on lines +51 to +62
logger.Debugf("Starting backfill for %s with chunk %v using query: %s", stream.ID(), chunk, stmt)

reader := jdbc.NewReader(ctx, stmt, func(ctx context.Context, query string, queryArgs ...any) (*sql.Rows, error) {
return d.client.QueryContext(ctx, query, args...)
})

return reader.Capture(func(rows *sql.Rows) error {
record := make(types.Record)
if err := jdbc.MapScan(rows, record, d.dataTypeConverter); err != nil {
return fmt.Errorf("failed to scan record data as map: %s", err)
}
return OnMessage(ctx, record)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what isolation mode are we using here ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

read committed or cursor stability as it is called by DB2

"dbclob": types.String,

// date / time
"time": types.String,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the time I am seeing in dbeaver is different from iceberg

Image

in iceberg

Image

Copy link
Collaborator Author

@ImDoubD-datazip ImDoubD-datazip Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in iceberg it is coming in utc format, 10 -> 4:30 (- 5:30)

"decfloat": types.Float64,

// boolean
"boolean": types.Bool,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is not boolean type in in db2, found this while testing

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Screenshot 2026-01-09 at 4 39 45 PM

}

if hasRows {
return nil, fmt.Errorf("stats not populated for table[%s]. Please run CLP command:\tRUNSTATS ON TABLE %s.%s WITH DISTRIBUTION AND DETAILED INDEXES ALL;\t to update table statistics", stream.ID(), stream.Namespace(), stream.Name())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for LOB (CLOB, DBCLOB, BLOB) and XML columns, don't support distribution statistics. Running RUNSTATS will give error while trying to run the given command in database

SQL2310N  The utility could not generate statistics.  Error "-668" was returned.

please mention about this in the doc as well

"real": types.Float32,
"float": types.Float64,
"numeric": types.Float64,
"double": types.Float64,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

testing with DECFLOAT34 datatype, in db it was

Image

in iceberg its

Image

is this due to spark that we are getting in scientific notation, is there a way we can get as its in db

reader := jdbc.NewReader(ctx, stmt, func(ctx context.Context, query string, queryArgs ...any) (*sql.Rows, error) {
return d.client.QueryContext(ctx, query, args...)
})

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in some cases blank values are taking NULL in others its blank

Image Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants