Skip to content

Conversation

@AndreaBozzo
Copy link
Contributor

This PR refactors the schedule file parsing in the sqllogictest crate to use serde-derived structs instead of manual TOML parsing, as requested in #1952.

Changes

  • New structs with serde derives:

    • ScheduleConfig - top-level configuration parsed from TOML
    • EngineConfig - per-engine configuration with #[serde(flatten)] for extensibility
    • EngineType - enum of supported engine types
  • Refactored parsing flow:

    • Schedule::from_file() now uses toml::from_str() directly
    • Added instantiate_engines() to separate parsing from engine creation
    • Removed manual parse_engines() and parse_steps() functions
  • Forward-compatibility:

Relation to #1943

This PR was suggested by @liurenjie1024 as a prerequisite to #1943 (dynamic catalog configuration). The #[serde(flatten)] approach allows #1943 to simply extract the catalog configuration from EngineConfig.extra without modifying the parsing logic.

Testing

  • All existing tests pass
  • Added new unit tests for deserialization behavior
  • Integration test with df_test.toml passes unchanged

Closes #1952

Replace manual TOML parsing with serde-derived structs to improve
maintainability and separate parsing from engine instantiation.

Changes:
- Add ScheduleConfig, EngineConfig, and EngineType structs with serde derives
- Use #[serde(flatten)] for forward-compatibility with future config fields
- Refactor Schedule::from_file() to use toml::from_str() directly
- Add instantiate_engines() to separate parsing from engine creation
- Remove manual parse_engines() and parse_steps() functions
- Update tests to verify deserialization behavior

Closes apache#1952

impl DataFusionEngine {
pub async fn new(config: TomlTable) -> Result<Self> {
pub async fn new(_name: &str, config: &EngineConfig) -> Result<Self> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you help me understand why do we add the name here? I can't think of any cases where name would matter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that if there are multiple instances, this could help debug/log which one fails. As this is actually currently unclear, i'll remove this, we can always re-add it later

/// Additional configuration fields for extensibility
/// This allows forward-compatibility with future fields like catalog_type, catalog_properties
#[serde(flatten)]
pub extra: HashMap<String, toml::Value>,
Copy link
Collaborator

@CTTY CTTY Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting direction, but I'm feeling that we will need another serializable config like DatafusionCatalogConfig in the future rather than using toml::Value directly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, see the suggestion above.

Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @AndreaBozzo , this is much better now. I have complete example of engine config, please help to fix it.


/// Configuration for a single engine instance
#[derive(Debug, Clone, Deserialize)]
pub struct EngineConfig {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pub struct EngineConfig {
pub enum EngineConfig {
DataFusion {
catalog: DatafusionCatalogConfig
}
}
pub struct DatafusionCatalogConfig {
typ: String,
properties: HashMap<String, String>
}

/// Additional configuration fields for extensibility
/// This allows forward-compatibility with future fields like catalog_type, catalog_properties
#[serde(flatten)]
pub extra: HashMap<String, toml::Value>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, see the suggestion above.

// Test forward-compatibility with extra fields (for PR #1943)
let input = r#"
[engines]
df = { type = "datafusion", catalog_type = "rest", some_future_field = "value" }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
df = { type = "datafusion", catalog_type = "rest", some_future_field = "value" }
df = { type = "datafusion",
catalog = {
type = "rest",
props = {
// A hashmap of properties.
url = "http://localhost:9000",
use_x = "true",
}
}
}

This is a complete example, and then we could use catalog loader to load a catalog.

…Config

- Replace EngineConfig struct with tagged enum for better type safety
- Add CatalogConfig struct with catalog_type and props fields
- Remove unused _name parameter from DataFusionEngine::new()
- Prepare structure for future catalog loader integration
@AndreaBozzo AndreaBozzo force-pushed the feat/serde-schedule-parsing branch from e8867c8 to f09e5e7 Compare December 23, 2025 09:19
@AndreaBozzo
Copy link
Contributor Author

AndreaBozzo commented Dec 23, 2025

Thanks for the detailed feedback @liurenjie1024 and @CTTY! I appreciate a lot your time and patience on this.

I've addressed the review comments in commit f09e5e7, the previous one lacked fmt

The memory catalog remains the default for now. Full catalog loader integration (REST, Glue, HMS, etc.) would have added MANY new dependencies and i wanted to keep it simple to review/modify eventually.

Thanks again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Use derived parser to parse integration tests schedule file rather than maintaining manual parsing code.

3 participants