Skip to content

Concat datasets are slower than they need to be. #15

@HesitantlyHuman

Description

@HesitantlyHuman

When you concat multiple datasets together, you get a deeply nested structure.
Something like this:

concat_dataset/
├─ concat_dataset/
│  ├─ concat_dataset/
│  │  ├─ concat_dataset/
│  │  │  ├─ dataset_0
│  │  │  ├─ dataset_1
│  │  ├─ dataset_2
│  ├─ dataset_3
├─ dataset_4

This is inefficient when instead we could flatten nested concat datasets into:

concat_dataset/
├─ dataset_0
├─ dataset_1
├─ dataset_2
├─ dataset_3
├─ dataset_4

I don't have numbers on the actual performance implications, but it will become significant if a user is doing many splits and concats.

Metadata

Metadata

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions