Skip to content

Comments

Linear probing hash join clean dev#5

Draft
gropaul wants to merge 368 commits intomy-featurefrom
linear-probing-hash-join-clean-dev
Draft

Linear probing hash join clean dev#5
gropaul wants to merge 368 commits intomy-featurefrom
linear-probing-hash-join-clean-dev

Conversation

@gropaul
Copy link
Owner

@gropaul gropaul commented Apr 16, 2024

No description provided.

Mytherin and others added 30 commits April 4, 2024 14:56
Note that this do not currently work for extensions that are double Loaded
This might happen (for example in tpch) if _init function calls Load explicitly
This clean-up CMake syntax and fixes a problem where empty strings would be conflated for no argument
Mytherin and others added 24 commits April 14, 2024 10:21
fix(jdbc): support non-string parameter types
We have our own signing mechanism, and they conflict making the Apple
signature invalid
Co-authored-by: Carlo Piovesan <piovesan.carlo@gmail.com>
…ions

Avoid performing Apple codesign on extensions
Filter out single relation predicates before join ordering
…value

Fix `last_value` in the `duckdb_sequences` metadata function
Limit batch insert threads based on available memory, similar to Parquet write
[Vacuum] Fix serialization and Copy of the VacuumStatement
@gropaul gropaul changed the base branch from main to my-feature April 16, 2024 08:35
@gropaul gropaul changed the base branch from my-feature to main May 24, 2024 12:29
@gropaul gropaul changed the base branch from main to my-feature May 24, 2024 12:29
gropaul pushed a commit that referenced this pull request Feb 18, 2025
We had two users crash with the following backtrace:

```
    frame #0: 0x0000ffffab2571ec
    frame #1: 0x0000aaaaac00c5fc duckling`duckdb::InternalException::InternalException(this=<unavailable>, msg=<unavailable>) at exception.cpp:328:2
    frame #2: 0x0000aaaaac1ee418 duckling`duckdb::optional_ptr<duckdb::OptimisticDataWriter, true>::CheckValid(this=<unavailable>) const at optional_ptr.hpp:34:11
    frame #3: 0x0000aaaaac1eea8c duckling`duckdb::MergeCollectionTask::Execute(duckdb::PhysicalBatchInsert const&, duckdb::ClientContext&, duckdb::GlobalSinkState&, duckdb::LocalSinkState&) [inlined] duckdb::optional_ptr<duckdb::OptimisticDataWriter, true>::operator*(this=<unavailable>) at optional_ptr.hpp:43:3
    frame #4: 0x0000aaaaac1eea84 duckling`duckdb::MergeCollectionTask::Execute(this=0x0000aaaaf1b06150, op=<unavailable>, context=0x0000aaaba820d8d0, gstate_p=0x0000aaab06880f00, lstate_p=<unavailable>) at physical_batch_insert.cpp:219:90
    frame #5: 0x0000aaaaac1d2e10 duckling`duckdb::PhysicalBatchInsert::Sink(duckdb::ExecutionContext&, duckdb::DataChunk&, duckdb::OperatorSinkInput&) const [inlined] duckdb::PhysicalBatchInsert::ExecuteTask(this=0x0000aaaafa62ab40, context=<unavailable>, gstate_p=0x0000aaab06880f00, lstate_p=0x0000aab12d442960) const at physical_batch_insert.cpp:425:8
    frame #6: 0x0000aaaaac1d2dd8 duckling`duckdb::PhysicalBatchInsert::Sink(duckdb::ExecutionContext&, duckdb::DataChunk&, duckdb::OperatorSinkInput&) const [inlined] duckdb::PhysicalBatchInsert::ExecuteTasks(this=0x0000aaaafa62ab40, context=<unavailable>, gstate_p=0x0000aaab06880f00, lstate_p=0x0000aab12d442960) const at physical_batch_insert.cpp:431:9
    frame #7: 0x0000aaaaac1d2dd8 duckling`duckdb::PhysicalBatchInsert::Sink(this=0x0000aaaafa62ab40, context=0x0000aab2fffd7cb0, chunk=<unavailable>, input=<unavailable>) const at physical_batch_insert.cpp:494:4
    frame #8: 0x0000aaaaac353158 duckling`duckdb::PipelineExecutor::ExecutePushInternal(duckdb::DataChunk&, duckdb::ExecutionBudget&, unsigned long) [inlined] duckdb::PipelineExecutor::Sink(this=0x0000aab2fffd7c00, chunk=0x0000aab2fffd7d30, input=0x0000fffec0aba8d8) at pipeline_executor.cpp:521:24
    frame #9: 0x0000aaaaac353130 duckling`duckdb::PipelineExecutor::ExecutePushInternal(this=0x0000aab2fffd7c00, input=0x0000aab2fffd7d30, chunk_budget=0x0000fffec0aba980, initial_idx=0) at pipeline_executor.cpp:332:23
    frame #10: 0x0000aaaaac34f7b4 duckling`duckdb::PipelineExecutor::Execute(this=0x0000aab2fffd7c00, max_chunks=<unavailable>) at pipeline_executor.cpp:201:13
    frame #11: 0x0000aaaaac34f258 duckling`duckdb::PipelineTask::ExecuteTask(duckdb::TaskExecutionMode) [inlined] duckdb::PipelineExecutor::Execute(this=<unavailable>) at pipeline_executor.cpp:278:9
    frame #12: 0x0000aaaaac34f250 duckling`duckdb::PipelineTask::ExecuteTask(this=0x0000aab16dafd630, mode=<unavailable>) at pipeline.cpp:51:33
    frame #13: 0x0000aaaaac348298 duckling`duckdb::ExecutorTask::Execute(this=0x0000aab16dafd630, mode=<unavailable>) at executor_task.cpp:49:11
    frame #14: 0x0000aaaaac356600 duckling`duckdb::TaskScheduler::ExecuteForever(this=0x0000aaaaf0105560, marker=0x0000aaaaf00ee578) at task_scheduler.cpp:189:32
    frame #15: 0x0000ffffab0a31fc
    frame #16: 0x0000ffffab2ad5c8
```

Core dump analysis showed that the assertion `D_ASSERT(lstate.writer);`
in `MergeCollectionTask::Execute` (i.e. it is crashing because
`lstate.writer` is NULLPTR) was not satisfied when
`PhysicalBatchInsert::Sink` was processing merge tasks from (other)
pipeline executors.

My suspicion is that this is only likely to happen for heavily
concurrent workloads (applicable to the two users which crashed). The
patch submitted as part of this PR has addressed the issue for these
users.
gropaul pushed a commit that referenced this pull request May 15, 2025
gropaul pushed a commit that referenced this pull request Dec 1, 2025
…uckdb#19680) (duckdb#19811)

Fixes duckdb#19680

This fixes a bug where queries using `NOT EXISTS` with `IS DISTINCT
FROM` returned incorrect results due to improper handling of NULL
semantics in the optimizer.

The issue was that the optimizer's deliminator incorrectly treated
`DISTINCT FROM` variants the same as regular equality/inequality
comparisons, which have different NULL handling:
  - `IS DISTINCT FROM`: NULL-aware (NULL IS DISTINCT FROM NULL = FALSE)
  - != or =: NULL-unaware (NULL != NULL = NULL, filters out NULLs)


### Incorrect Query Plan

```
┌───────────────────────────┐
│         PROJECTION        │
│    ────────────────────   │
│             c2            │
│                           │
│          ~0 rows          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         PROJECTION        │
│    ────────────────────   │
│             #5            │
│__internal_decompress_integ│
│     ral_integer(#3, 1)    │
│             #1            │
│                           │
│          ~0 rows          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│      NESTED_LOOP_JOIN     │
│    ────────────────────   │
│      Join Type: ANTI      │
│    Conditions: c2 != c2   ├──────────────┐
│                           │              │
│          ~0 rows          │              │
└─────────────┬─────────────┘              │
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│         PROJECTION        ││         PROJECTION        │
│    ────────────────────   ││    ────────────────────   │
│            NULL           ││            NULL           │
│             #2            ││             #2            │
│            NULL           ││            NULL           │
│             #1            ││             #1            │
│            NULL           ││            NULL           │
│             #0            ││             #0            │
│            NULL           ││            NULL           │
│                           ││                           │
│          ~2 rows          ││           ~1 row          │
└─────────────┬─────────────┘└─────────────┬─────────────┘
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│         PROJECTION        ││         PROJECTION        │
│    ────────────────────   ││    ────────────────────   │
│             #0            ││             #0            │
│__internal_compress_integra││__internal_compress_integra│
│     l_utinyint(#1, 1)     ││     l_utinyint(#1, 1)     │
│             #2            ││             #2            │
│                           ││                           │
│          ~2 rows          ││           ~1 row          │
└─────────────┬─────────────┘└─────────────┬─────────────┘
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│         PROJECTION        ││         PROJECTION        │
│    ────────────────────   ││    ────────────────────   │
│            NULL           ││            NULL           │
│             #0            ││             #0            │
│            NULL           ││            NULL           │
│                           ││                           │
│          ~2 rows          ││           ~1 row          │
└─────────────┬─────────────┘└─────────────┬─────────────┘
┌─────────────┴─────────────┐┌─────────────┴─────────────┐
│         SEQ_SCAN          ││           FILTER          │
│    ────────────────────   ││    ────────────────────   │
│         Table: t0         ││     (col0 IS NOT NULL)    │
│   Type: Sequential Scan   ││                           │
│      Projections: c2      ││                           │
│                           ││                           │
│          ~2 rows          ││           ~1 row          │
└───────────────────────────┘└─────────────┬─────────────┘
                             ┌─────────────┴─────────────┐
                             │         SEQ_SCAN          │
                             │    ────────────────────   │
                             │         Table: t0         │
                             │   Type: Sequential Scan   │
                             │      Projections: c2      │
                             │                           │
                             │          ~2 rows          │
                             └───────────────────────────┘
```

  The buggy plan shows two critical issues:
```
  ┌─────────────┴─────────────┐
  │      NESTED_LOOP_JOIN     │
  │      Join Type: ANTI      │
  │    Conditions: c2 != c2   │  ← ❌ Wrong(the join conditions should be c2 IS DISTINCT FROM c2)
  │          ~0 rows          │
  └─────────────┬─────────────┘
                │
                └─────────────┐
                             ┌┴─────────────┐
                             │   FILTER     │
                             │ (col0 IS NOT │  ← ❌ Wrong(the filter should be removed)
                             │    NULL)     │
                             └──────────────┘
```

### Solution

This PR adds proper support for DISTINCT FROM operators throughout the
optimization pipeline:

1. Preserve DISTINCT FROM semantics in join
conversion.(src/optimizer/deliminator.cpp)
```
// NOTE: We should NOT convert DISTINCT FROM to != in general
// Only convert if the ORIGINAL join had != or = (not DISTINCT FROM variants)
if (delim_join.join_type != JoinType::MARK &&
    original_join_comparison != ExpressionType::COMPARE_DISTINCT_FROM &&
    original_join_comparison != ExpressionType::COMPARE_NOT_DISTINCT_FROM) {
    // Safe to convert
}
```
2. Skip NULL filters for DISTINCT FROM
variants.(src/optimizer/deliminator.cpp)
```
// Only add IS NOT NULL filter for regular equality/inequality comparisons
// Do NOT add for DISTINCT FROM variants, as they handle NULL correctly
if (cond.comparison != ExpressionType::COMPARE_NOT_DISTINCT_FROM &&
    cond.comparison != ExpressionType::COMPARE_DISTINCT_FROM) {
    // Add IS NOT NULL filter
}
```
3. Added negation support for COMPARE_DISTINCT_FROM and
COMPARE_NOT_DISTINCT_FROM
    in expression type handling.(src/common/enums/expression_type.cpp)
4. Updated parser to properly negate IS DISTINCT FROM expressions when
wrapped with NOT.
(src/parser/transform/expression/transform_bool_expr.cpp)
5. Added regression test in
test/sql/subquery/exists/test_correlated_exists_with_derived_table.test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.