Keep large unbinned contigs for downstream steps#29
Conversation
|
Checks seem to fail because of internet connectivity? Weird. |
|
Ok, BUSCO obviously is now available in v4beta (we are using v3) and the location of the database changed. Hope I fixed this now. Default database is just 8Mb, maybe we could either upgrade to v4 or store the v3 default database in nf-core/test-datasets that we do not loose it again... edit: seriously, the file seems to have changed as well? Wow. |
|
Do you mind if I review/merge this after 1.0.0? I'd like the first release to be done before Christmas 😄 Concerning Busco I changed the url in dev, I can make a PR to update to v4 (after 1.0.0 ! 😁 ) |
|
Fine for me, but I think that's a major flaw. The solution proposed here works fine with test data, but it could be that a pooled bin that is produced right now and forwarded to downstream processes is too big with real data and needs a change. If you manage to get your release out before I optimize this step here (holiday right now, not working on it atm), go ahead ;) |
|
A pooled bin being too big seems like a separate issue? Since you are not forwarding the unbinned contigs to downstream processes? (Happy holidays!) |
D4straub v0.2
|
Still linting errors but coming closer ... |
|
No errors any more, review please. |
| out_base = (os.path.splitext(input_file)[0]) | ||
|
|
||
| # Read file | ||
| fasta_sequences = SeqIO.parse(open(input_file),'fasta') |
There was a problem hiding this comment.
files should be opened with the with context manager. that way the file handles will close automatically at the end of the scope.
There was a problem hiding this comment.
i.e.
with open(input_file) as f:
fasta_sequences = SeqIO.parse(f, 'fasta')
# rest of the code that needs the file to be open.| """ | ||
| jgi_summarize_bam_contig_depths --outputDepth depth.txt ${bam} | ||
| metabat2 -t "${task.cpus}" -i "${assembly}" -a depth.txt -o "MetaBAT2/${name}" -m ${min_size} | ||
| metabat2 -t "${task.cpus}" -i "${assembly}" -a depth.txt -o "MetaBAT2/${name}" -m ${min_size} --seed 1 --unbinned |
There was a problem hiding this comment.
shouldn't there rather been no seed by default but an option to fix the seed?
There was a problem hiding this comment.
My aim was to make sure that every time the workflow runs the same results are created. Therefore I fixed the seed here. But later I realized that at least SPAdes is not producing the identical results, obviously there is also some randomness in there and I haven't seen an option to fix the seed there. So I don't feel that it necessary at all any more.
This solves #27