Skip to content

tax_classification_taxids is the channel that needs collect for CAT_summary#433

Closed
maxibor wants to merge 0 commit intonf-core:devfrom
maxibor:dev
Closed

tax_classification_taxids is the channel that needs collect for CAT_summary#433
maxibor wants to merge 0 commit intonf-core:devfrom
maxibor:dev

Conversation

@maxibor
Copy link
Copy Markdown
Member

@maxibor maxibor commented May 5, 2023

When running mag with the CAT contig taxonomic annotation tool, I stumbled upon the following error:

 Workflow execution completed unsuccessfully

Error executing process > 'NFCORE_MAG:MAG:CAT_SUMMARY'

Caused by:
  Process `NFCORE_MAG:MAG:CAT_SUMMARY` input file name collision -- There are multiple input files for each of the following file names: MEGAHIT-DASTool-SZG018.ORF2LCA.names.txt.gz, MEGAHIT-DASTool-SZG023.ORF2LCA.names.txt.gz, MEGAHIT-DASTool-SZG023.bin2classification.names.txt.gz, MEGAHIT-DASTool-SZG017.ORF2LCA.names.txt.gz, MEGAHIT-DASTool-SZG024.ORF2LCA.names.txt.gz, MEGAHIT-DASTool-SZG017.bin2classification.names.txt.gz, MEGAHIT-DASTool-SZG022.bin2classification.names.txt.gz, MEGAHIT-DASTool-SZG025.ORF2LCA.names.txt.gz, MEGAHIT-DASTool-SZG020.ORF2LCA.names.txt.gz, MEGAHIT-DASTool-SZG018.bin2classification.names.txt.gz, MEGAHIT-DASTool-SZG025.bin2classification.names.txt.gz, MEGAHIT-DASTool-SZG020.bin2classification.names.txt.gz, MEGAHIT-DASTool-SZG024.bin2classification.names.txt.gz, MEGAHIT-DASTool-SZG022.ORF2LCA.names.txt.gz

Looking more closely at the source of this error, I believe it because the CAT_SUMMARY process only expects .bin2classification.names.txt.gz while there are both .ORF2LCA.names.txt.gz and .bin2classification.names.txt.gz in the tax_classification channel. The tax_classification_taxids is the one containing only the .bin2classification.names.txt.gz files.

PR checklist

  • This comment contains a description of changes (with reason).
  • Make sure your code lints (nf-core lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

@maxibor maxibor added the bug Something isn't working label May 5, 2023
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 5, 2023

nf-core lint overall result: Passed ✅ ⚠️

Posted for pipeline commit 119b220

+| ✅ 158 tests passed       |+
#| ❔   1 tests were ignored |#
!| ❗   1 tests had warnings |!
Details

❗ Test warnings:

  • pipeline_todos - TODO string in methods_description_template.yml: #Update the HTML below to your prefered methods description, e.g. add publication citation for this pipeline

❔ Tests ignored:

  • files_unchanged - File ignored due to lint config: lib/NfcoreTemplate.groovy

✅ Tests passed:

Run details

  • nf-core/tools version 2.8
  • Run at 2023-06-19 14:09:15

@maxibor maxibor requested a review from jfy133 May 5, 2023 14:56
@maxibor
Copy link
Copy Markdown
Member Author

maxibor commented May 5, 2023

Copy link
Copy Markdown
Member

@jfy133 jfy133 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I've not touched CAT before. Will give some time for @d4straub or @skrakau to shout if something is wrong before merging :)

@d4straub
Copy link
Copy Markdown
Collaborator

d4straub commented May 9, 2023

Hm, there seems to be a difference between
[assembler]-[binner]-[sample/group].bin2classification.names.txt.gz: Taxonomy classification of the genome bins, with full lineage names
and
[assembler]-[binner]-[sample/group].bin2classification.txt.gz: Taxonomy classification of the genome bins
according to https://nf-co.re/mag/2.3.0/output#taxonomic-classification-of-binned-genomes
I do not remember though whether that is relevant by changing for example the summary files. Did you check?
If it does change resulting files, maybe the two different *names* files could be put into separate output channels?

@jfy133 jfy133 added this to the 2.4.0 milestone May 26, 2023
@maxibor
Copy link
Copy Markdown
Member Author

maxibor commented May 30, 2023

I checked again, and CAT_SUMMARY is expecting the *bin2classification.names.txt.gz files. I've updated the PR accordingly @d4straub .

Copy link
Copy Markdown
Member

@jfy133 jfy133 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Config should be reactivated now (see here), but otherwise LGTM,

I do notice that ORF2LCA and bin2classification is not any more picked up from raw/, - but I assume you've tested this

Comment thread CHANGELOG.md Outdated
@@ -30,6 +30,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- [#439](https://github.com/nf-core/mag/pull/445) - Fix bug in assembly input (by @prototaxites)
- [#447](https://github.com/nf-core/mag/pull/447) - Remove `default: None` from parameter schema (by @drpatelh)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

@tillenglert
Copy link
Copy Markdown
Contributor

tillenglert commented Jul 11, 2023

If I may add:

Looking more closely at the source of this error, I believe it because the CAT_SUMMARY process only expects .bin2classification.names.txt.gz while there are both .ORF2LCA.names.txt.gz and .bin2classification.names.txt.gz in the tax_classification channel. The tax_classification_taxids is the one containing only the .bin2classification.names.txt.gz files.

I tracked the problem not to the duality of the input (.ORF2LCA.names.txt.gz and .bin2classification.names.txt.gz) but instead to the naming of the DASTool output/input for CAT. The unbinned input for CAT resulting from DASTool refinement is named [Assembler]-DASTool-[prefix].*, which is the same name for the binned input. Therefore creating not only the collision for CAT_SUMMARY but also overwriting the output in the results dir.

The log from CAT includes the number of bins (which should be always 1 for DASTool Unbinned, I think?) and one could probably grep this number to rename the output of CAT accordingly.

I hope this helps! :)

@jfy133
Copy link
Copy Markdown
Member

jfy133 commented Jul 11, 2023

DASTool refinement is named [Assembler]-DASTool-[prefix].*, which is the same name for the binned input.

@tillenglert coudl you confirm that the RENAME_POSTDASTOOL module was executed?

If I understand what you identified, this should be addressed by this module...?

@tillenglert
Copy link
Copy Markdown
Contributor

tillenglert commented Jul 11, 2023

Yes I can confirm that this module was executed and the file is named [Assembler]-DASToolUnbinned-[prefix].fa

The problem is not in the input to CAT but the output! I guess: meta.binner is still the same for *Refined and *DASToolUnbinned which then creates the same name in the CAT process.

Sorry I guess I have written this a bit weird in the top comment!

@jfy133
Copy link
Copy Markdown
Member

jfy133 commented Jul 11, 2023

Ahh got you! Ok maybe that indeed makes sense, sorry:


    withName: CAT {
        publishDir = [
            path: { "${params.outdir}/Taxonomy/CAT/${meta.assembler}/${meta.binner}" },
            mode: params.publish_dir_mode,
            saveAs: { filename -> filename.equals('versions.yml') ? null : filename }
        ]
    }
    ```
    
  We would need to investigate if we ebed into the meta whether teh files are unbinned or not I guess, and then can insert that into the file name for CAT... 
  
  I see CAT currently doesn't use a prefix, but rather hard codes the file names in the module (another local module..) so this makes much more sense now :face_exhaling: 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants