Skip to content

Fix split_fasta.py bug and improve runtime#175

Merged
skrakau merged 3 commits intonf-core:devfrom
skrakau:improve_splitting_unbinned
Mar 24, 2021
Merged

Fix split_fasta.py bug and improve runtime#175
skrakau merged 3 commits intonf-core:devfrom
skrakau:improve_splitting_unbinned

Conversation

@skrakau
Copy link
Copy Markdown
Member

@skrakau skrakau commented Mar 24, 2021

There were two problems with the split_fasta.py script:

Bug:

  • The sort function sort_values() was applied to df, but the result was nowhere stored. So I added an inplace=True.
  • In any case, the index still refers to the position before sorting. Currently, after max_sequences it stops writing sequences to individual files, although the first max_sequences where not the longest and not necessarily have a length >= length_threshold. Thus sequences are missed. To address this I reset the index.

Runtime (see #166):

  • The sort_values() function is applied to all sequences (O(n log n)). This is not necessary, since one can first separate all sequences below the length_threshold. Thus one can sort only the sequences >= length_threshold, take the longest max_sequences and add the remaining to the pooled list.
  • For the file with the unbinned sequences, which required > 300h this run through in a couple of minutes

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - add to the software_versions process and a regex to scrape_software_versions.py
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • If necessary, also make a PR on the nf-core/mag branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core lint .).
  • Ensure the test suite passes (nextflow run . -profile test,docker).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

@skrakau skrakau requested a review from d4straub March 24, 2021 10:02
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 24, 2021

nf-core lint overall result: Passed ✅ ⚠️

Posted for pipeline commit e43c974

+| ✅ 117 tests passed       |+
#| ❔   4 tests were ignored |#
!| ❗   4 tests had warnings |!
Details ### ❗ Test warnings:

❔ Tests ignored:

  • files_unchanged - File does not exist: .github/workflows/push_dockerhub_dev.yml
  • files_unchanged - File does not exist: .github/workflows/push_dockerhub_release.yml
  • conda_env_yaml - No environment.yml file found - skipping conda_env_yaml test
  • conda_dockerfile - No environment.yml / Dockerfile file found - skipping conda_dockerfile test

✅ Tests passed:

Run details

  • nf-core/tools version 1.13.2
  • Run at 2021-03-24 10:17:05

Copy link
Copy Markdown
Collaborator

@d4straub d4straub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@skrakau skrakau merged commit 7cfc745 into nf-core:dev Mar 24, 2021
@skrakau skrakau deleted the improve_splitting_unbinned branch May 31, 2021 13:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants