Skip to content

Add taxonomic classification for bins with gtdb-tk#178

Merged
skrakau merged 32 commits intonf-core:devfrom
skrakau:add_gtdbtk
May 12, 2021
Merged

Add taxonomic classification for bins with gtdb-tk#178
skrakau merged 32 commits intonf-core:devfrom
skrakau:add_gtdbtk

Conversation

@skrakau
Copy link
Copy Markdown
Member

@skrakau skrakau commented Mar 25, 2021

I played with GTDB-tk for taxonomic classification as an alternative for CAT:

  • previous GTDB references stay accessible (CAT DBs only upon request currently)
  • it will be clear which GTDB references will be compatible with which GTDB-tk version, i.e. it will be guaranteed that all pipelines version (with a specific GTDB-tk version) can be run at all times

Disadvantages:

  • only uses marker genes, will not work for smaller bins

TODOs:

  • update BUSCO and make use of auto lineage functionality
  • use --force to continue processing if an error occurs on a single genome (since *.classify.tree output would be interesting, parallelising over individual bins is likely not a good option)
  • merge summary with Busco and quast summaries

(addresses #173)

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - add to the software_versions process and a regex to scrape_software_versions.py
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • If necessary, also make a PR on the nf-core/mag branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core lint .).
  • Ensure the test suite passes (nextflow run . -profile test,docker).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

Comment thread main.nf Outdated
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 3, 2021

nf-core lint overall result: Passed ✅ ⚠️

Posted for pipeline commit 0e4f1b7

+| ✅ 123 tests passed       |+
#| ❔   5 tests were ignored |#
!| ❗  73 tests had warnings |!
Details

❗ Test warnings:

  • files_exist - File not found: environment.yml
  • files_exist - File not found: Dockerfile
  • nextflow_config - Config variable not found: process.container
  • params_used - Config variable not found in main.nf: params.input
  • params_used - Config variable not found in main.nf: params.single_end
  • params_used - Config variable not found in main.nf: params.save_trimmed_fail
  • params_used - Config variable not found in main.nf: params.mean_quality
  • params_used - Config variable not found in main.nf: params.trimming_quality
  • params_used - Config variable not found in main.nf: params.keep_phix
  • params_used - Config variable not found in main.nf: params.phix_reference
  • params_used - Config variable not found in main.nf: params.host_fasta
  • params_used - Config variable not found in main.nf: params.host_genome
  • params_used - Config variable not found in main.nf: params.host_removal_verysensitive
  • params_used - Config variable not found in main.nf: params.host_removal_save_ids
  • params_used - Config variable not found in main.nf: params.binning_map_mode
  • params_used - Config variable not found in main.nf: params.skip_binning
  • params_used - Config variable not found in main.nf: params.min_contig_size
  • params_used - Config variable not found in main.nf: params.min_length_unbinned_contigs
  • params_used - Config variable not found in main.nf: params.max_unbinned_contigs
  • params_used - Config variable not found in main.nf: params.coassemble_group
  • params_used - Config variable not found in main.nf: params.spades_options
  • params_used - Config variable not found in main.nf: params.megahit_options
  • params_used - Config variable not found in main.nf: params.skip_spades
  • params_used - Config variable not found in main.nf: params.skip_spadeshybrid
  • params_used - Config variable not found in main.nf: params.skip_megahit
  • params_used - Config variable not found in main.nf: params.skip_quast
  • params_used - Config variable not found in main.nf: params.centrifuge_db
  • params_used - Config variable not found in main.nf: params.kraken2_db
  • params_used - Config variable not found in main.nf: params.skip_krona
  • params_used - Config variable not found in main.nf: params.cat_db
  • params_used - Config variable not found in main.nf: params.gtdb
  • params_used - Config variable not found in main.nf: params.gtdbtk_min_completeness
  • params_used - Config variable not found in main.nf: params.gtdbtk_max_contamination
  • params_used - Config variable not found in main.nf: params.gtdbtk_min_perc_aa
  • params_used - Config variable not found in main.nf: params.gtdbtk_min_af
  • params_used - Config variable not found in main.nf: params.gtdbtk_pplacer_cpus
  • params_used - Config variable not found in main.nf: params.gtdbtk_pplacer_scratch
  • params_used - Config variable not found in main.nf: params.skip_adapter_trimming
  • params_used - Config variable not found in main.nf: params.keep_lambda
  • params_used - Config variable not found in main.nf: params.longreads_min_length
  • params_used - Config variable not found in main.nf: params.longreads_keep_percent
  • params_used - Config variable not found in main.nf: params.longreads_length_weight
  • params_used - Config variable not found in main.nf: params.lambda_reference
  • params_used - Config variable not found in main.nf: params.skip_busco
  • params_used - Config variable not found in main.nf: params.busco_reference
  • params_used - Config variable not found in main.nf: params.busco_download_path
  • params_used - Config variable not found in main.nf: params.busco_auto_lineage_prok
  • params_used - Config variable not found in main.nf: params.save_busco_reference
  • params_used - Config variable not found in main.nf: params.megahit_fix_cpu_1
  • params_used - Config variable not found in main.nf: params.spades_fix_cpus
  • params_used - Config variable not found in main.nf: params.spadeshybrid_fix_cpus
  • params_used - Config variable not found in main.nf: params.metabat_rng_seed
  • params_used - Config variable not found in main.nf: params.multiqc_config
  • params_used - Config variable not found in main.nf: params.multiqc_title
  • params_used - Config variable not found in main.nf: params.max_multiqc_email_size
  • params_used - Config variable not found in main.nf: params.skip_multiqc
  • params_used - Config variable not found in main.nf: params.outdir
  • params_used - Config variable not found in main.nf: params.tracedir
  • params_used - Config variable not found in main.nf: params.publish_dir_mode
  • params_used - Config variable not found in main.nf: params.email
  • params_used - Config variable not found in main.nf: params.email_on_fail
  • params_used - Config variable not found in main.nf: params.plaintext_email
  • params_used - Config variable not found in main.nf: params.enable_conda
  • params_used - Config variable not found in main.nf: params.singularity_pull_docker_container
  • params_used - Config variable not found in main.nf: params.validate_params
  • params_used - Config variable not found in main.nf: params.hostnames
  • params_used - Config variable not found in main.nf: params.config_profile_description
  • params_used - Config variable not found in main.nf: params.config_profile_contact
  • params_used - Config variable not found in main.nf: params.config_profile_url
  • params_used - Config variable not found in main.nf: params.max_memory
  • params_used - Config variable not found in main.nf: params.max_cpus
  • params_used - Config variable not found in main.nf: params.max_time
  • schema_description - No description provided in schema for parameter: skip_multiqc

❔ Tests ignored:

  • files_unchanged - File ignored due to lint config: lib/NfcoreSchema.groovy
  • files_unchanged - File does not exist: .github/workflows/push_dockerhub_dev.yml
  • files_unchanged - File does not exist: .github/workflows/push_dockerhub_release.yml
  • conda_env_yaml - No environment.yml file found - skipping conda_env_yaml test
  • conda_dockerfile - No environment.yml / Dockerfile file found - skipping conda_dockerfile test

✅ Tests passed:

Run details

  • nf-core/tools version 1.14
  • Run at 2021-05-11 15:31:16

@skrakau
Copy link
Copy Markdown
Member Author

skrakau commented May 5, 2021

Here another try:

  • update to GTDB-Tk v1.5.0
    • contains an update regarding the handling of failed genomes
  • created subworkflow for bin filtering, GTDB-Tk classification and summary generation
  • added a bin filtering step based on BUSCO QC results, mandatory currently (completeness > 0.0)!
  • did not use the GTDB-Tk parameter --force in the end
    • GTDB-Tk can handle failed genomes (without --force) in certain cases, e.g. if prodigal finds no gene or if the AAs fraction in the MSA is too low. These bins will be reported accordingly.
    • if it fails for other currently by GTDB-Tk unhandled reasons, e.g. because no marker gene could be detected, these bins will currently not be reported accordingly
    • to keep control, I assume only bins with a certain completeness reported by BUSCO will be used, which should avoid this problem. If GTDB-Tk error occurs -> mag pipeline error
  • changed the process merge_quast_and_busco to merge_quast_busco_gtdbtk:
    • combine BUSCO, QUAST and GTDB-Tk summaries, if >= two of those available
    • -> results/GenomeBinning/bin_summary.tsv
    • moved generation of quast_summary.tsv to separate process QUAST_BINS_SUMMARY
  • add process GTDBTK_SUMMARY to create gtdbtk_summary.tsv, containing all bins (also bins filtered out based on BUSCO QC and by GTDB-Tk)

A remaining question is, which results of GTDB-tk exactly to report in the summary file. Currently:

user_genome     classification  fastani_reference       fastani_ani     fastani_af      closest_placement_reference     closest_placement_ani   closest_placement_af    classification_method   msa_percent     red_value       warnings

(see https://ecogenomics.github.io/GTDBTk/files/summary.tsv.html#summary-tsv)

And I need to reduce the memory consumption and test out a few parameters for this ...

@d4straub
Copy link
Copy Markdown
Collaborator

d4straub commented May 5, 2021

A remaining question is, which results of GTDB-tk exactly to report in the summary file. Currently:

That is fine in my opinion. Better a little too many columns than too less information, in my opinion.

@skrakau skrakau marked this pull request as ready for review May 7, 2021 17:22
Copy link
Copy Markdown
Collaborator

@d4straub d4straub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats a great addition to the pipeline!

Comment thread bin/summary_gtdbtk.py Outdated
@skrakau
Copy link
Copy Markdown
Member Author

skrakau commented May 11, 2021

Great, thanks a lot @d4straub for feedback and reviewing!

@skrakau skrakau merged commit 22f6dbd into nf-core:dev May 12, 2021
@skrakau skrakau deleted the add_gtdbtk branch July 22, 2022 09:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants