Skip to content

feature: added support as source for ducklake#295

Open
Femi3211 wants to merge 1 commit intodevelopfrom
feature/ducklake-integration
Open

feature: added support as source for ducklake#295
Femi3211 wants to merge 1 commit intodevelopfrom
feature/ducklake-integration

Conversation

@Femi3211
Copy link
Collaborator

@Femi3211 Femi3211 commented Jan 30, 2026

Summary by CodeRabbit

  • New Features
    • Added DuckLake database source support with configurable connection parameters (database path, data path, metadata database)
    • Implemented environment variable substitution for DuckLake configuration settings
    • Added schema extraction and validation for DuckLake databases, including table, view, and column discovery
    • Integrated dynamic module loading for extensible extractors

✏️ Tip: You can customize this high-level summary in your review settings.

@Femi3211 Femi3211 requested a review from nbesimi January 30, 2026 14:18
@coderabbitai
Copy link

coderabbitai bot commented Jan 30, 2026

📝 Walkthrough

Walkthrough

This PR introduces DuckLake database support by adding three new configuration fields to the Connection model, implementing a DuckLakeGenerator class that orchestrates extraction from DuckDB via DuckLake integration, and adding factory routing logic to handle DuckLake connections as a distinct code path.

Changes

Cohort / File(s) Summary
DuckLake Configuration Fields
common/src/main/java/com/adaptivescale/rosetta/common/models/input/Connection.java
Added three new DuckLake-specific fields (duckdbDatabasePath, ducklakeDataPath, ducklakeMetadataDb) with corresponding getters and setters.
Configuration Parameter Substitution
cli/src/main/java/com/adaptivescale/rosetta/cli/ConfigYmlConverter.java
Added environment variable substitution for the three new DuckLake fields within the config processing loop, mirroring URL substitution logic.
DuckLake Generator Implementation
source/src/main/java/com/adataptivescale/rosetta/source/core/DuckLakeGenerator.java
Introduced comprehensive DuckLakeGenerator class with orchestration logic: validates configuration, establishes JDBC connections, manages DuckLake catalog setup, registers Parquet files as tables, dynamically loads extractors, and assembles Database objects. Includes helper methods for metadata handling, parquet registration, dynamic module loading, and debug utilities.
Factory Routing for DuckLake
source/src/main/java/com/adataptivescale/rosetta/source/core/SourceGeneratorFactory.java
Added early-return logic to detect dbType == "ducklake" and instantiate DuckLakeGenerator, bypassing standard extractor initialization for non-DuckLake cases.

Sequence Diagram

sequenceDiagram
    participant Config as Configuration
    participant Factory as SourceGeneratorFactory
    participant DLGen as DuckLakeGenerator
    participant JDBC as DuckDB JDBC
    participant DL as DuckLake Catalog
    participant Extractors as Table/View/Column<br/>Extractors
    participant DB as Database Object

    Config->>Factory: sourceGenerator(connection, driverProvider)
    Factory->>Factory: Check dbType == "ducklake"
    Factory->>DLGen: new DuckLakeGenerator(driverProvider)
    Factory->>DLGen: generate(connection)
    
    DLGen->>DLGen: Validate ducklakeDataPath
    DLGen->>DLGen: Build DuckDB JDBC URL
    DLGen->>JDBC: Open connection
    
    DLGen->>DL: setupDuckLake()<br/>(install/load)
    DLGen->>DL: attachMetadata()
    DLGen->>DL: useCatalog()
    DLGen->>DL: registerParquetFiles()
    
    DLGen->>Extractors: Load Table Extractor<br/>(with fallback)
    DLGen->>Extractors: Load View Extractor<br/>(with fallback)
    DLGen->>Extractors: Load Column Extractor<br/>(with fallback)
    
    DLGen->>JDBC: Execute table extraction
    DLGen->>Extractors: Filter metadata tables
    DLGen->>JDBC: Execute view extraction
    DLGen->>Extractors: Extract columns
    
    DLGen->>DB: Assemble Database<br/>(name, tables, views, type)
    DLGen->>Factory: Return Database
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🦆 A duck waddled in with DuckLake so bright,
Parquet files nested, a magnificent sight,
Catalogs nested, extractors aligned,
Dynamic loading with fallbacks designed—
Rosetta now swims where the ducks congregate! 🌊

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding DuckLake as a data source. It clearly summarizes the primary objective of the pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/ducklake-integration

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In
`@source/src/main/java/com/adataptivescale/rosetta/source/core/DuckLakeGenerator.java`:
- Around line 243-248: The SQL string in DuckLakeGenerator directly interpolates
parquetFile.getAbsolutePath() into createTableSql which allows file path
injection via single quotes; sanitize the path before interpolation by escaping
single quotes (e.g., replace "'" with "''") and then use the escaped value when
building createTableSql (references: DuckLakeGenerator, variables parquetFile,
tableName, createStmt, createTableSql) so the generated CREATE TABLE ...
read_parquet('...') literal cannot break SQL syntax.
- Around line 300-344: The executeDebugSQL method currently logs full SQL and
result rows via log.info; change it to avoid leaking sensitive data by (1)
updating the method signature of executeDebugSQL (and any callers) to accept a
boolean flag like enableResultLogging (or use an existing debug flag on
Connection), (2) change the SQL statement log to a non-default level (TRACE or
DEBUG) using log.trace(...) and only emit it when tracing is enabled, and (3)
wrap the result-row logging block (the loop that builds rows and the log.info
calls) in a conditional that checks enableResultLogging before logging row
contents; keep setupDuckLake/connect logic but ensure no sensitive output is
logged when the flag is false (optionally move this utility to a test/debug-only
module if desired).
- Around line 190-208: The ATTACH/USE SQL builds raw SQL from user inputs
(catalogName, rosettaConnection.getDucklakeDataPath()) in DuckLakeGenerator
which allows SQL injection; validate and sanitize these inputs before use:
enforce a strict regex (e.g. only [A-Za-z0-9_]+) for catalogName and reject or
normalize any value that doesn't match, and validate ducklakeDataPath to ensure
it is a safe filesystem path (no semicolons, quotes, or SQL metacharacters) or
escape/quote it correctly; then update the attachSql construction and the
useStmt.execute call (the String.format building attachSql and the "USE " +
catalogName call) to use the validated/escaped values only, failing fast with a
clear exception if validation fails.
- Around line 353-356: The SQL built in importCsvToDuckLake interpolates
tableName and csvFilePath directly causing SQL injection; update
importCsvToDuckLake to validate and safely quote/escape inputs: validate
tableName against a strict identifier regex (e.g. [A-Za-z_][A-Za-z0-9_]* ) and
reject or sanitize otherwise, quote the identifier properly (escape any internal
quotes) rather than raw string concat, and treat csvFilePath as a data parameter
or at minimum escape/quote it (escape single quotes) before inserting into the
SQL string; ensure similar protection for catalogName if it can be controlled
externally and use prepared statements or the DB driver's identifier-quoting
helper where available.
🧹 Nitpick comments (4)
source/src/main/java/com/adataptivescale/rosetta/source/core/DuckLakeGenerator.java (4)

60-65: Unchecked casts could mask type mismatches.

The casts on lines 60 and 65 assume that tableExtractor.extract() and viewExtractor.extract() return Collection<Table> and Collection<View> respectively. While this aligns with the expected interface contracts, consider using generics on the TableExtractor and ViewExtractor interfaces to make this type-safe, or add explicit type checking.


104-123: Consider pattern-based filtering for metadata tables.

The hardcoded list of DuckLake metadata tables may become stale as DuckLake evolves. Consider using a prefix-based filter (e.g., tables starting with ducklake_) which would be more maintainable.

♻️ Alternative: Prefix-based filtering
     private Collection<Table> filterDuckLakeMetadataTables(Collection<Table> allTables) {
-        Set<String> metadataTableNames = Set.of(
-            "ducklake_column", "ducklake_column_tag", ...
-        );
-
         Collection<Table> userTables = new ArrayList<>();
         for (Table table : allTables) {
-            if (!metadataTableNames.contains(table.getName())) {
+            if (!table.getName().startsWith("ducklake_")) {
                 userTables.add(table);
             }
         }
         return userTables;
     }

43-77: Consider try-with-resources for cleaner resource management.

The current try-finally pattern works but could be simplified using try-with-resources for java.sql.Connection.

♻️ Refactor to try-with-resources
         Driver driver = driverProvider.getDriver(tempConnection);
         Properties properties = JDBCUtils.setJDBCAuth(tempConnection);
-        java.sql.Connection connect = driver.connect(duckdbUrl, properties);
-
-        try {
+        try (java.sql.Connection connect = driver.connect(duckdbUrl, properties)) {
             String attachedCatalogAlias = setupDuckLake(connect, connection);
             // ... rest of method ...
             return database;
-        } finally {
-            connect.close();
         }

80-102: Code duplication with generate() method.

The validate() method duplicates the connection setup and DuckLake initialization logic from generate(). Consider extracting the common setup into a private helper method to improve maintainability.

♻️ Extract common setup logic
private java.sql.Connection createDuckLakeConnection(Connection connection) throws Exception {
    if (connection.getDucklakeDataPath() == null || connection.getDucklakeDataPath().trim().isEmpty()) {
        throw new IllegalArgumentException("ducklakeDataPath is required for DuckLake connections");
    }
    
    String duckdbUrl = buildDuckDbUrl(connection);
    Connection tempConnection = new Connection();
    tempConnection.setUrl(duckdbUrl);
    tempConnection.setDbType("duckdb");
    Driver driver = driverProvider.getDriver(tempConnection);
    Properties properties = JDBCUtils.setJDBCAuth(tempConnection);
    return driver.connect(duckdbUrl, properties);
}

Then use in both generate() and validate().

Comment on lines +190 to +208
String attachSql = String.format(
"ATTACH 'ducklake:%s' AS %s (DATA_PATH '%s');",
metadataDb, catalogName, rosettaConnection.getDucklakeDataPath()
);

try {
log.info("Attaching DuckLake catalog: {}", attachSql);
stmt.execute(attachSql);
} catch (SQLException e) {
if (e.getMessage() != null && e.getMessage().contains("already exists")) {
log.info("Catalog '{}' is already attached", catalogName);
} else {
throw e;
}
}

try (Statement useStmt = connection.createStatement()) {
useStmt.execute("USE " + catalogName + ";");
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

SQL injection vulnerability in DuckLake setup.

The catalogName and ducklakeDataPath values are interpolated directly into SQL statements via String.format() without sanitization. These values originate from user configuration and could contain malicious SQL.

For example, a malicious databaseName like foo; DROP TABLE users; -- would be directly embedded in the ATTACH and USE statements.

Consider validating that catalogName contains only alphanumeric characters and underscores, or use identifier quoting appropriate for DuckDB.

🛡️ Proposed fix: Add input validation for catalog name
+    private static final java.util.regex.Pattern VALID_IDENTIFIER = 
+        java.util.regex.Pattern.compile("^[a-zA-Z_][a-zA-Z0-9_]*$");
+
     private String setupDuckLake(java.sql.Connection connection, Connection rosettaConnection) throws SQLException {
         Statement stmt = connection.createStatement();
         try {
             // ... existing code ...
             
             String catalogName = rosettaConnection.getDatabaseName();
             if (catalogName == null || catalogName.trim().isEmpty()) {
                 throw new IllegalArgumentException("databaseName is required for DuckLake connections");
             }
+            if (!VALID_IDENTIFIER.matcher(catalogName).matches()) {
+                throw new IllegalArgumentException("databaseName must be a valid SQL identifier (alphanumeric and underscores only)");
+            }
🤖 Prompt for AI Agents
In
`@source/src/main/java/com/adataptivescale/rosetta/source/core/DuckLakeGenerator.java`
around lines 190 - 208, The ATTACH/USE SQL builds raw SQL from user inputs
(catalogName, rosettaConnection.getDucklakeDataPath()) in DuckLakeGenerator
which allows SQL injection; validate and sanitize these inputs before use:
enforce a strict regex (e.g. only [A-Za-z0-9_]+) for catalogName and reject or
normalize any value that doesn't match, and validate ducklakeDataPath to ensure
it is a safe filesystem path (no semicolons, quotes, or SQL metacharacters) or
escape/quote it correctly; then update the attachSql construction and the
useStmt.execute call (the String.format building attachSql and the "USE " +
catalogName call) to use the validated/escaped values only, failing fast with a
clear exception if validation fails.

Comment on lines +243 to +248
try {
String createTableSql = String.format(
"CREATE TABLE IF NOT EXISTS %s AS SELECT * FROM read_parquet('%s');",
tableName, parquetFile.getAbsolutePath()
);
createStmt.execute(createTableSql);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

File path injection in SQL query.

The absolute file path is interpolated directly into the SQL query without escaping. A specially crafted file path containing single quotes could break the SQL syntax or enable injection.

While less exploitable than the catalog name issue (requires file system access to create malicious filenames), consider escaping single quotes in the path.

🛡️ Proposed fix: Escape single quotes in file path
                 try {
+                    String escapedPath = parquetFile.getAbsolutePath().replace("'", "''");
                     String createTableSql = String.format(
                         "CREATE TABLE IF NOT EXISTS %s AS SELECT * FROM read_parquet('%s');",
-                        tableName, parquetFile.getAbsolutePath()
+                        tableName, escapedPath
                     );
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
try {
String createTableSql = String.format(
"CREATE TABLE IF NOT EXISTS %s AS SELECT * FROM read_parquet('%s');",
tableName, parquetFile.getAbsolutePath()
);
createStmt.execute(createTableSql);
try {
String escapedPath = parquetFile.getAbsolutePath().replace("'", "''");
String createTableSql = String.format(
"CREATE TABLE IF NOT EXISTS %s AS SELECT * FROM read_parquet('%s');",
tableName, escapedPath
);
createStmt.execute(createTableSql);
🤖 Prompt for AI Agents
In
`@source/src/main/java/com/adataptivescale/rosetta/source/core/DuckLakeGenerator.java`
around lines 243 - 248, The SQL string in DuckLakeGenerator directly
interpolates parquetFile.getAbsolutePath() into createTableSql which allows file
path injection via single quotes; sanitize the path before interpolation by
escaping single quotes (e.g., replace "'" with "''") and then use the escaped
value when building createTableSql (references: DuckLakeGenerator, variables
parquetFile, tableName, createStmt, createTableSql) so the generated CREATE
TABLE ... read_parquet('...') literal cannot break SQL syntax.

Comment on lines +300 to +344
// Helper method to execute SQL commands
public static void executeDebugSQL(Connection connection, String sql) throws Exception {
if (connection.getDucklakeDataPath() == null || connection.getDucklakeDataPath().trim().isEmpty()) {
throw new IllegalArgumentException("ducklakeDataPath is required for DuckLake connections");
}

String duckdbUrl = connection.getDuckdbDatabasePath() != null && !connection.getDuckdbDatabasePath().trim().isEmpty()
? "jdbc:duckdb:" + connection.getDuckdbDatabasePath()
: "jdbc:duckdb:";

Connection tempConnection = new Connection();
tempConnection.setUrl(duckdbUrl);
tempConnection.setDbType("duckdb");
Driver driver = new DriverManagerDriverProvider().getDriver(tempConnection);
Properties properties = JDBCUtils.setJDBCAuth(tempConnection);
java.sql.Connection connect = driver.connect(duckdbUrl, properties);

try {
DuckLakeGenerator generator = new DuckLakeGenerator(new DriverManagerDriverProvider());
generator.setupDuckLake(connect, connection);

try (Statement stmt = connect.createStatement()) {
log.info("Executing SQL: {}", sql);
boolean hasResults = stmt.execute(sql);
if (hasResults) {
try (ResultSet rs = stmt.getResultSet()) {
log.info("Query returned results:");
int colCount = rs.getMetaData().getColumnCount();
while (rs.next()) {
StringBuilder row = new StringBuilder(" ");
for (int i = 1; i <= colCount; i++) {
if (i > 1) row.append(" | ");
row.append(rs.getString(i));
}
log.info(row.toString());
}
}
} else {
log.info("SQL executed successfully. Rows affected: {}", stmt.getUpdateCount());
}
}
} finally {
connect.close();
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Debug utility logs potentially sensitive data.

The executeDebugSQL method logs:

  • The full SQL query (line 322)
  • Complete query results row by row (line 334)

In production, this could inadvertently expose sensitive data (credentials, PII) to logs. Consider:

  1. Restricting log level to TRACE/DEBUG
  2. Adding a flag to disable result logging
  3. Moving these utilities to a separate debug/test module if not needed in production
🛡️ Proposed fix: Use TRACE level for sensitive output
             try (Statement stmt = connect.createStatement()) {
-                log.info("Executing SQL: {}", sql);
+                log.debug("Executing SQL: {}", sql);
                 boolean hasResults = stmt.execute(sql);
                 if (hasResults) {
                     try (ResultSet rs = stmt.getResultSet()) {
-                        log.info("Query returned results:");
+                        log.trace("Query returned results:");
                         int colCount = rs.getMetaData().getColumnCount();
                         while (rs.next()) {
                             StringBuilder row = new StringBuilder("  ");
                             for (int i = 1; i <= colCount; i++) {
                                 if (i > 1) row.append(" | ");
                                 row.append(rs.getString(i));
                             }
-                            log.info(row.toString());
+                            log.trace(row.toString());
                         }
                     }
                 } else {
-                    log.info("SQL executed successfully. Rows affected: {}", stmt.getUpdateCount());
+                    log.debug("SQL executed successfully. Rows affected: {}", stmt.getUpdateCount());
                 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Helper method to execute SQL commands
public static void executeDebugSQL(Connection connection, String sql) throws Exception {
if (connection.getDucklakeDataPath() == null || connection.getDucklakeDataPath().trim().isEmpty()) {
throw new IllegalArgumentException("ducklakeDataPath is required for DuckLake connections");
}
String duckdbUrl = connection.getDuckdbDatabasePath() != null && !connection.getDuckdbDatabasePath().trim().isEmpty()
? "jdbc:duckdb:" + connection.getDuckdbDatabasePath()
: "jdbc:duckdb:";
Connection tempConnection = new Connection();
tempConnection.setUrl(duckdbUrl);
tempConnection.setDbType("duckdb");
Driver driver = new DriverManagerDriverProvider().getDriver(tempConnection);
Properties properties = JDBCUtils.setJDBCAuth(tempConnection);
java.sql.Connection connect = driver.connect(duckdbUrl, properties);
try {
DuckLakeGenerator generator = new DuckLakeGenerator(new DriverManagerDriverProvider());
generator.setupDuckLake(connect, connection);
try (Statement stmt = connect.createStatement()) {
log.info("Executing SQL: {}", sql);
boolean hasResults = stmt.execute(sql);
if (hasResults) {
try (ResultSet rs = stmt.getResultSet()) {
log.info("Query returned results:");
int colCount = rs.getMetaData().getColumnCount();
while (rs.next()) {
StringBuilder row = new StringBuilder(" ");
for (int i = 1; i <= colCount; i++) {
if (i > 1) row.append(" | ");
row.append(rs.getString(i));
}
log.info(row.toString());
}
}
} else {
log.info("SQL executed successfully. Rows affected: {}", stmt.getUpdateCount());
}
}
} finally {
connect.close();
}
}
// Helper method to execute SQL commands
public static void executeDebugSQL(Connection connection, String sql) throws Exception {
if (connection.getDucklakeDataPath() == null || connection.getDucklakeDataPath().trim().isEmpty()) {
throw new IllegalArgumentException("ducklakeDataPath is required for DuckLake connections");
}
String duckdbUrl = connection.getDuckdbDatabasePath() != null && !connection.getDuckdbDatabasePath().trim().isEmpty()
? "jdbc:duckdb:" + connection.getDuckdbDatabasePath()
: "jdbc:duckdb:";
Connection tempConnection = new Connection();
tempConnection.setUrl(duckdbUrl);
tempConnection.setDbType("duckdb");
Driver driver = new DriverManagerDriverProvider().getDriver(tempConnection);
Properties properties = JDBCUtils.setJDBCAuth(tempConnection);
java.sql.Connection connect = driver.connect(duckdbUrl, properties);
try {
DuckLakeGenerator generator = new DuckLakeGenerator(new DriverManagerDriverProvider());
generator.setupDuckLake(connect, connection);
try (Statement stmt = connect.createStatement()) {
log.debug("Executing SQL: {}", sql);
boolean hasResults = stmt.execute(sql);
if (hasResults) {
try (ResultSet rs = stmt.getResultSet()) {
log.trace("Query returned results:");
int colCount = rs.getMetaData().getColumnCount();
while (rs.next()) {
StringBuilder row = new StringBuilder(" ");
for (int i = 1; i <= colCount; i++) {
if (i > 1) row.append(" | ");
row.append(rs.getString(i));
}
log.trace(row.toString());
}
}
} else {
log.debug("SQL executed successfully. Rows affected: {}", stmt.getUpdateCount());
}
}
} finally {
connect.close();
}
}
🤖 Prompt for AI Agents
In
`@source/src/main/java/com/adataptivescale/rosetta/source/core/DuckLakeGenerator.java`
around lines 300 - 344, The executeDebugSQL method currently logs full SQL and
result rows via log.info; change it to avoid leaking sensitive data by (1)
updating the method signature of executeDebugSQL (and any callers) to accept a
boolean flag like enableResultLogging (or use an existing debug flag on
Connection), (2) change the SQL statement log to a non-default level (TRACE or
DEBUG) using log.trace(...) and only emit it when tracing is enabled, and (3)
wrap the result-row logging block (the loop that builds rows and the log.info
calls) in a conditional that checks enableResultLogging before logging row
contents; keep setupDuckLake/connect logic but ensure no sensitive output is
logged when the flag is false (optionally move this utility to a test/debug-only
module if desired).

Comment on lines +353 to +356
String sql = String.format(
"USE %s; CREATE TABLE %s AS SELECT * FROM read_csv_auto('%s');",
catalogName, tableName, csvFilePath
);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

SQL injection in importCsvToDuckLake.

The tableName and csvFilePath parameters are interpolated directly into SQL without validation or escaping. This has the same injection risks as noted in setupDuckLake.

🛡️ Proposed fix: Validate table name and escape file path
+    private static final java.util.regex.Pattern VALID_IDENTIFIER = 
+        java.util.regex.Pattern.compile("^[a-zA-Z_][a-zA-Z0-9_]*$");
+
     public static void importCsvToDuckLake(Connection connection, String csvFilePath, String tableName) throws Exception {
         String catalogName = connection.getDatabaseName();
         if (catalogName == null || catalogName.trim().isEmpty()) {
             throw new IllegalArgumentException("databaseName must be set to the DuckLake catalog name");
         }
+        if (!VALID_IDENTIFIER.matcher(tableName).matches()) {
+            throw new IllegalArgumentException("tableName must be a valid SQL identifier");
+        }
+        if (!VALID_IDENTIFIER.matcher(catalogName).matches()) {
+            throw new IllegalArgumentException("catalogName must be a valid SQL identifier");
+        }
+        String escapedCsvPath = csvFilePath.replace("'", "''");

         String sql = String.format(
             "USE %s; CREATE TABLE %s AS SELECT * FROM read_csv_auto('%s');",
-            catalogName, tableName, csvFilePath
+            catalogName, tableName, escapedCsvPath
         );
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
String sql = String.format(
"USE %s; CREATE TABLE %s AS SELECT * FROM read_csv_auto('%s');",
catalogName, tableName, csvFilePath
);
if (!VALID_IDENTIFIER.matcher(tableName).matches()) {
throw new IllegalArgumentException("tableName must be a valid SQL identifier");
}
if (!VALID_IDENTIFIER.matcher(catalogName).matches()) {
throw new IllegalArgumentException("catalogName must be a valid SQL identifier");
}
String escapedCsvPath = csvFilePath.replace("'", "''");
String sql = String.format(
"USE %s; CREATE TABLE %s AS SELECT * FROM read_csv_auto('%s');",
catalogName, tableName, escapedCsvPath
);
🤖 Prompt for AI Agents
In
`@source/src/main/java/com/adataptivescale/rosetta/source/core/DuckLakeGenerator.java`
around lines 353 - 356, The SQL built in importCsvToDuckLake interpolates
tableName and csvFilePath directly causing SQL injection; update
importCsvToDuckLake to validate and safely quote/escape inputs: validate
tableName against a strict identifier regex (e.g. [A-Za-z_][A-Za-z0-9_]* ) and
reject or sanitize otherwise, quote the identifier properly (escape any internal
quotes) rather than raw string concat, and treat csvFilePath as a data parameter
or at minimum escape/quote it (escape single quotes) before inserting into the
SQL string; ensure similar protection for catalogName if it can be controlled
externally and use prepared statements or the DB driver's identifier-quoting
helper where available.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant