[WIP] Prototype: count k-mers in banded mode in a single pass#1818
[WIP] Prototype: count k-mers in banded mode in a single pass#1818
Conversation
The procedure above will result in the same output as the Python commands below, which do 4 passes over the reads. for i in range(4):
counts = khmer.Counttable(25, 1e7 / 4, 4)
counts.consume_seqfile_banding('reads.fq.gz', 4, i)
of = 'bandtest{}.ct'.format(i)
counts.save(of) |
|
For my education: why split into bands but then create N counttables? I understand "let's make N bands because we have far too much data anyway" but that doesn't fit with then making N bands. Tim is confused.com |
|
We are investigating some use cases where we compute only on a single band, but there is at least one use case (motivated by kevlar and variant discovery) where we want to compute on each band independently. In this case, the banding provides us a way of randomly sharding the k-mers into N disjoint sets, each of which only requires 1/N of the memory required to store the entire data set. This PR is looking for an efficient way to compute counts for all bands in a single first pass, rather than N passes. |
This PR is a Python prototype of the idea described in #1800. In brief, the idea is to count k-mers in the input reads and populate N counttables (each with ≈1/N of the k-mers) in a single pass.
This prototype implements a k-mer buffer that, when "full", writes to disk and flushes/resets. Once all the input reads are processed, the temp files are read back from disk 1-by-1 to populate a Counttable for each band.
The prototype works as expected, but the performance is painfully slow on large data sets. It would need a Cython or C++ implementation with threading support to be viable methinks. I would welcome any thoughts. cc @camillescott @betatim @ctb @luizirber
make testDid it pass the tests?make clean diff-coverIf it introduces new functionality inscripts/is it tested?make format diff_pylint_report cppcheck doc pydocstyleIs it wellformatted?
additions are allowed without a major version increment. Changing file
formats also requires a major version number increment.
documented in
CHANGELOG.md? See keepachangelogfor more details.
changes were made?
tested for streaming IO?)