Skip to content

Conversation

@nschan
Copy link
Collaborator

@nschan nschan commented Jul 9, 2025

As suggested here this is full refactor of genomeassembler to support sample-level parameterisation of everything.
Currently, this PR contains the full pipeline, tested with stub runs of heterogenous samples in a single sample sheet.

Why?
Often when doing genome assembly, we do not know what works best. With this change, this pipeline can be used to compare different settings for the same set of reads, to compare the assembly outcome. Samples that share the same value in group will be combined during reporting to facilitate comparisons of strategies on the same input(s). The report process / script will be updated to fit this new design (ongoing).

Details
This was a bit more tricky than I had initially hoped. Essentially, all params are stuffed into a main channel, which contains a map. I think a map is the only way to handle this channel safely, since sometimes entries are replaced and I am afraid that positional indexing would be too confusing (for me).

This works fine, but channels containing maps cannot be joined. For this reason, a pattern that looks like:

map_channel_1
            // Convert to list for join
            .map { it -> it.collect { entry -> [ entry.value, entry ] } }
            .join( map_channel_2
                     // Convert to list for join
                    .map { it -> it.collect { entry -> [ entry.value, entry ] } }
            )
            // After joining re-create the maps from the stored map
            .map { it -> it.collect { _entry, map -> [ (map.key): map.value ] }.collectEntries() }

is used throughout to join map channels and recover the map after joining.
Generally, to facilitate asynchronous movement mix() is used and join() is (hopefully) used considerately to avoid blocking resulting from waiting for processes.

The overall sample-wise parameterisation is offloaded to subworkflows/local/utils_nfcore_genomeassembler_pipeline/main.nf; currently this does not produce errors (even though it should). This should be a minor fix.
This also does some validation, and consolidates conflicts that may arise from params that are incompatible with certain samples, e.g. medaka cannot be used if there are no ONT reads.
Currently, there are no tests included since I would like to get some feedback on whether this is at all reasonable, or if there would be better ways to do things.
I have tested this with a samplesheet that looks like:

sample,ontreads,hifireads,ref_fasta,ref_gff,shortread_F,shortread_R,paired,strategy,assembler
Sample_flye,ONT.fastq.gz,,ref.fasta,ref.gff3,shortread_F.fastq.gz,shortread_R.fastq.gz,true,single,flye
Sample_hifiasm,,hifi_reads.fastq.gz,ref.fasta,ref.gff3,shortread_F.fastq.gz,shortread_R.fastq.gz,true,single,hifiasm
Sample_hifiasm_ont,ONT.fastq.gz,,ref.fasta,ref.gff3,shortread_F.fastq.gz,shortread_R.fastq.gz,true,single,hifiasm
Sample_hifiasm_ul,ONT.fastq.gz,hifi_reads.fastq.gz,ref.fasta,ref.gff3,shortread_F.fastq.gz,shortread_R.fastq.gz,true,hybrid,hifiasm
Sample_flye_hifiasm_scaff,ONT.fastq.gz,hifi_reads.fastq.gz,ref.fasta,ref.gff3,shortread_F.fastq.gz,shortread_R.fastq.gz,true,scaffold,flye_hifiasm
Sample_hifiasm_scaff,ONT.fastq.gz,hifi_reads.fastq.gz,ref.fasta,ref.gff3,shortread_F.fastq.gz,shortread_R.fastq.gz,true,scaffold,hifiasm_hifiasm

In combination with different params (e.g. --polish_pilon, --scaffold_longstitch, etc) in stub runs, so I think the overall logic is fine.

nschan and others added 25 commits May 22, 2025 15:49
* add prefix to singularity container for report

* add files exist check for references, closes nf-core#165
* Template update for nf-core/tools version 3.2.1

* Template update for nf-core/tools version 3.3.1

* merge template 3.3.1 - fix linting

* update pre-commit

* merge template 3.3.1 - fix linting

* pre-commit config?

* pre-commit config?

* reinstall links

* try larger runner

* smaller run, disable bloom filter for hifiasm test

* updated test snapshot

* updated test snapshot

* update nftignore

* update nftignore

* update nftignore

* update nftignore

* update nftignore

* update nftignore

* update nftignore

* update nftignore

* update nftignore

* Update .github/actions/nf-test/action.yml

Co-authored-by: Matthias Hörtenhuber <mashehu@users.noreply.github.com>

* Update docs/output.md

Co-authored-by: Matthias Hörtenhuber <mashehu@users.noreply.github.com>

* remove .nf-test.log

---------

Co-authored-by: Niklas Schandry <niklas@bio.lmu.de>
Co-authored-by: Matthias Hörtenhuber <mashehu@users.noreply.github.com>
@nf-core-bot
Copy link
Member

nf-core-bot commented Jul 9, 2025

Warning

Newer version of the nf-core template is available.

Your pipeline is using an old version of the nf-core template: 3.3.2.
Please update your pipeline to the latest version.

For more documentation on how to update your pipeline, please see the nf-core documentation and Synchronisation documentation.

nschan and others added 4 commits July 10, 2025 14:57
* update awk regex

* update snapshot
* add prefix to singularity container for report

* include gawk in gfa2fa env

* include gawk in gfa2fa env
* add prefix to singularity container for report

* include gawk in gfa2fa env

* include gawk in gfa2fa env

* mawk version

* mawk version in stub

* update CHANGELOG

* Update CHANGELOG.md

Co-authored-by: Simon Pearce <24893913+SPPearce@users.noreply.github.com>

* [automated] Fix code linting

---------

Co-authored-by: Simon Pearce <24893913+SPPearce@users.noreply.github.com>
Co-authored-by: nf-core-bot <core@nf-co.re>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants