Skip to content

Fix for NFCORE_MAG:MAG:CAT_SUMMARY input file name collision#489

Merged
jfy133 merged 38 commits intonf-core:devfrom
maxibor:custom
Nov 3, 2023
Merged

Fix for NFCORE_MAG:MAG:CAT_SUMMARY input file name collision#489
jfy133 merged 38 commits intonf-core:devfrom
maxibor:custom

Conversation

@maxibor
Copy link
Copy Markdown
Member

@maxibor maxibor commented Aug 8, 2023

This PR brings a fix to what was first described in #433
The issue happened when using --postbinning_input both in combination with domain annotation, bin_refinement and CAT profiling.
This is because bins annotated as eukaryotic are sent to the refined bins channels as is, bypassing dastool, and hence keeping the same name.
This fix adds a "refined" entry to the meta, and includes it in the new prefix for the CAT process.

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs- [ ] If necessary, also make a PR on the nf-core/mag branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

@maxibor maxibor changed the title Fix for NFCORE_MAG:MAG:CAT_SUMMARY input file name collision` Fix for NFCORE_MAG:MAG:CAT_SUMMARY input file name collision Aug 8, 2023
@maxibor maxibor requested review from jfy133 and tillenglert August 8, 2023 10:06
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Aug 8, 2023

nf-core lint overall result: Passed ✅ ⚠️

Posted for pipeline commit 1c41f35

+| ✅ 158 tests passed       |+
#| ❔   1 tests were ignored |#
!| ❗   3 tests had warnings |!
Details

❗ Test warnings:

  • pipeline_todos - TODO string in main.nf: Remove this line if you don't need a FASTA file [TODO: try and test using for --host_fasta and --host_genome]
  • pipeline_todos - TODO string in methods_description_template.yml: #Update the HTML below to your preferred methods description, e.g. add publication citation for this pipeline
  • pipeline_todos - TODO string in WorkflowMag.groovy: Optionally add in-text citation tools to this list.

❔ Tests ignored:

  • files_unchanged - File ignored due to lint config: lib/NfcoreTemplate.groovy

✅ Tests passed:

Run details

  • nf-core/tools version 2.10
  • Run at 2023-11-02 13:27:11

@jfy133 jfy133 requested review from CarsonJM and prototaxites August 8, 2023 12:11
Copy link
Copy Markdown
Member

@jfy133 jfy133 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor comments, but a more general one, won't most of the downstream steps that can take in output from bin refinement also need update to their prefixes (E.g. in modules.conf)?

Otherwise LGTM :)

Comment thread subworkflows/local/binning_refinement.nf Outdated
Comment thread workflows/mag.nf Outdated
Copy link
Copy Markdown
Contributor

@prototaxites prototaxites left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

One minor thought - I added params.cat_official_taxonomy before I really got to grips with how $args works. Now that CAT has a $args option, might it make sense to move that into there inside the config?

A few minor comments, but a more general one, won't most of the downstream steps that can take in output from bin refinement also need update to their prefixes (E.g. in modules.conf)?

I am not sure that there are clashes in my experience, but I don't think more detailed labelling of files hurts (except in user readability down the line). This might be something to think about more deeply when planning v3.0!

@CarsonJM
Copy link
Copy Markdown
Contributor

CarsonJM commented Aug 8, 2023

@prototaxites I think it would make sense to move params.cat_official_taxonomy into modules.conf!

Also, do you think it would be a good idea to standardize how prefixes are defined (currently some prefixes are defined in within the module's def prefix line and others seem to be defined in modules.conf under ext.prefix)? Perhaps there's a reason for this, but just a thought!

Copy link
Copy Markdown
Contributor

@CarsonJM CarsonJM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

Copy link
Copy Markdown
Contributor

@tillenglert tillenglert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Had some minor comments/ideas though. Thanks for the implementation and Bugfix!

Comment thread nextflow.config Outdated
Comment thread workflows/mag.nf Outdated
@jfy133
Copy link
Copy Markdown
Member

jfy133 commented Aug 10, 2023

@maxibor (or anyone else) any thoughts on my original feedback:

<...> but a more general one, won't most of the downstream steps that can take in output from bin refinement also need update to their prefixes (E.g. in modules.conf)?

Comment thread CHANGELOG.md Outdated
@maxibor
Copy link
Copy Markdown
Member Author

maxibor commented Aug 24, 2023

Update: this is actually a more generalized issue with processes that perform a collect: CAT, but also Quast, and GTDB-Tk

@jfy133
Copy link
Copy Markdown
Member

jfy133 commented Aug 24, 2023

@maxibor (or anyone else) any thoughts on my original feedback:

<...> but a more general one, won't most of the downstream steps that can take in output from bin refinement also need update to their prefixes (E.g. in modules.conf)?

I guess this was correct then???

I wonder why this issue hasn't come up before...?!

@maxibor
Copy link
Copy Markdown
Member Author

maxibor commented Aug 24, 2023

It's a new issue because samples can now appear multiple times in downstream channels for the following reasons:

  • Split by domain (new, only happens when using domain classification)
  • Split by refined/not refined (only happens when using bin refinement)
  • Split by binned/unbinned (only happens when using --postbinning_input both)

When they're collected at the final steps (Quast, GTDB-Tk, or CAT), bins that were not renamed by the way they were treated now appear multiple times with same name, and triggers the input file name collision.

Which in addition was particularly challenging to debug because of a misreporting of the actual error nextflow-io/nextflow#3828

@prototaxites
Copy link
Copy Markdown
Contributor

prototaxites commented Aug 27, 2023

I’ve been thinking about this and I’m not sure we need the refined/unrefined label to fix this - this might specifically just be a bug in the code that assigns bins post-refinement.

If we think about the original sources of bins, we have bins coming from the three binners - metabat, maxbin, and concoct. Each of these bins should have a unique numbered file name. If refinement is enabled, these bins are fed into DAS tool and a new set of bins are returned - that are not necessarily the same as the input bins. These should also be uniquely named, differently to the original bins because their binner is listed as DAS tool.

If domain classification is enabled with bin refinement, then undefined eukaryotic bins are passed as “refined” at this stage as otherwise they will be lost, unless ‘both’ is set for post-binning input. I think this is the only time that a duplicate bin can slip through. So don’t we just have to fix the code so that eukaryotic bins are only fed to the refined bin channel if ‘both’ is not selected? Otherwise, every bin should be unique and only be able to enter each process downstream exactly once, because they might get split into separate channels, and recombined, but never duplicated?

@maxibor
Copy link
Copy Markdown
Member Author

maxibor commented Aug 28, 2023

A small schematic to illustrate the issue.
The bins do not need to be renamed, only the results of GTDB-Tk, Quast, and CAT. Otherwise the .collect() for the summary process of these three tools will fail with an input file name collision.

mag meta flowthrough_230828_161552

I've now update the meta's accordingly, and passed them through when needed.

@maxibor
Copy link
Copy Markdown
Member Author

maxibor commented Aug 28, 2023

There are now 4 possible values for the refinement meta:

  • unrefined for bins that won't be refined
  • unrefined_unbinned for the contigs that were not binned and will not be refined
  • dastool_refined for bins that were refined with DASTool
  • dastool_refined_unbinned for the unbinned processed with DASTool

Note that the input file name collision was not only be triggered by the lack of refinment tracking, but also by the lack of domain tracking.
So using all the meta to Quast/GTDB-TK/CAT output filenames will avoid this error (while not affecting bin names).

@jfy133
Copy link
Copy Markdown
Member

jfy133 commented Aug 28, 2023

That's a really helpful diagram @maxibor thank you very much!

@jfy133
Copy link
Copy Markdown
Member

jfy133 commented Oct 1, 2023

I am currently downloading the latest GTDB database, then I'll do a manual 'full test' and if all looks good then we can merge!

@jfy133
Copy link
Copy Markdown
Member

jfy133 commented Oct 2, 2023

Self notes:

  • CheckM looks good: summary seems to be raw bins (MEGAHIT-MetaBAT2-test_minigut_sample2.unbinned.1) but per CheckM run looks good1
  • QUAST looks good: summary seems to be raw bins (MEGAHIT-CONCOCT-test_minigut_67.fa) but per QUAST run looks good1
  • CAT looks good: summary seems to be raw bins (MEGAHIT-CONCOCT-test_minigut_sample2_24.fa) but per CAT run output has correct names1
  • GTDBTK looks good: couldn't test due to db issue

This can be merged once I've done one test without domain classsifcation, and I'll follow up with a few other things I noticed with a separate PR

Footnotes

  1. Indeed because they do report on physical binned collection of bin FASTAs rather than the file itself? When looking in the work directory of QUAST_BINS_SUMMARY all the files are there, but the contents of the files have the original bin names 2 3

@jfy133
Copy link
Copy Markdown
Member

jfy133 commented Oct 2, 2023

I propose merge this, patch release, then I'll make issues and a further patch release of the other more minor things I found (once i've had a chance to look into them)

@jfy133
Copy link
Copy Markdown
Member

jfy133 commented Oct 3, 2023

I'm happy with this now after some testing! Just need pull upstream from dev and CHANGELOG update :)

@jfy133
Copy link
Copy Markdown
Member

jfy133 commented Oct 4, 2023

SPeaking privately apparently Maxime found more issues (but is currently on holiday)

@jfy133
Copy link
Copy Markdown
Member

jfy133 commented Oct 30, 2023

Ok, there are multiple issues but not related directly to problem this PR is solving.

We will break this down bit-by-bit so merging this now, and fixes will come in over time 👍

Thanks again @maxibor !

EDIT: once the latest dev branch has been pulled into this PR and changelog updated!

@jfy133
Copy link
Copy Markdown
Member

jfy133 commented Oct 30, 2023

@nf-core-bot fix linting

Comment thread subworkflows/local/depths.nf Outdated
@jfy133
Copy link
Copy Markdown
Member

jfy133 commented Nov 1, 2023

Just need to remove some dumps, but hopefully the .flatten() of the bins going into depth fixes the bug, then will merge

Comment thread subworkflows/local/busco_qc.nf Outdated
Comment thread subworkflows/local/depths.nf Outdated
Comment thread subworkflows/local/depths.nf Outdated
Comment thread subworkflows/local/gtdbtk.nf Outdated
@maxibor
Copy link
Copy Markdown
Member Author

maxibor commented Nov 2, 2023

Long time in the making update:
That'll would be the pattern for better solution regarding the input file name collision

new_channel = channel_name.collectFile(keepHeader: true) {
        meta, f ->
        [f.getBaseName()+'.tsv', f]
    }
WHATEVER_SUMMARY_PROCESS(new_channel.collect())

This pattern takes all the files from the channel_name channel that share the same name and concatenates them into one (and keepHeader makes sure only header is written).
It's the equivalent of

cat a/test.tsv  b/test.tsv > test.tsv

@jfy133
Copy link
Copy Markdown
Member

jfy133 commented Nov 3, 2023

Long time in the making update: That'll would be the pattern for better solution regarding the input file name collision

new_channel = channel_name.collectFile(keepHeader: true) {
        meta, f ->
        [f.getBaseName()+'.tsv', f]
    }
WHATEVER_SUMMARY_PROCESS(new_channel.collect())

This pattern takes all the files from the channel_name channel that share the same name and concatenates them into one (and keepHeader makes sure only header is written). It's the equivalent of

cat a/test.tsv  b/test.tsv > test.tsv

Hm interesting... but do we want to collapse those bins back into a single bin again? I'm pretty sure of the tools rely on the file name to disambiguate between bins so I'm worried we would then not be able to distinguish between eukaryotic vs non-eukaryotic bins in the summary tools.

Furthermore, does it support gzipped files? The original error you reported are gzipped output from CAT and I don't see reference to that in the NXF docs on collectFile

That said, I may be being overly paranoid there. So I suggest that I merge this current PR as a conservative approach (making sure everything is very explicitly independent). But then I will make an issue describing what you're suggesting as an enhancement - as other than my (maybe unfounded) worry above - that would indeed be an elegant solution 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants