Skip to content

Add usage of validation split to cleaning script #5

@Lena-Jurkschat

Description

@Lena-Jurkschat

Python cleaning script data-preparation/preprocessing/training/01a_catalogue_cleaning_and_filtering/clean.py
is using only the train split at the moment. Iteration over splits is needed and the filter application on all of them is needed!

Hint: just deleting the used split load_from_disk(dataset_path)['train'] by deleting the square brackets will not do it, because you will receive a DatasetDict Object then instead of a Dataset one. In consequence there is dataset.select() not possible because the method only exists for Dataset type

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions