how to do asynchronous, secondary reads from files and databases

i have the following problem: i read in keys from a (level) DB; upon each key i see, i want to open another readstream where i look for more data related to the first key and write that back into the stream. my data collection is rather big so certainly there are many chunks of raw bytes read and written before one run is finished. this tends to hide the obnoxious fact that, sometimes, some pieces of data would appear to get lost in the process. 

i've written a library called [Pipedreams](https://github.com/loveencounterflow/pipedreams) to help me abstract away the dirty details of stream handling; it mainly relies on `through` to get its stuff done. i've been building up and using that library for a couple months now, and since i knew even less about streams back when i started, it sure does contain bugs (when you compare the code of individual functions in that library, you can see the different strata of comprehension / skill i had at my disposal when writing that function. time for an overhaul i guess). 

i have illustrated my specific use case over at https://gist.github.com/loveencounterflow/65fd8ec711cf78950aa0; i'm using short files there instead of levelDB, but the problem remains the same as far as i can see. 

as i see it, the problem seems to be that when i open a readstream and pipe it through transformers, everything goes fine as long as the transformers are all synchronous. but when one transformer has to do something asynchronous—like openening a file to read lines—then its results _may_ come late to the pipe's destination stream. it is as if the end or close event of the source stream is passed through to the destination which then closes, to, letting some data stranded. the gist's output clearly shows that all processing is indeed performed, but it never reaches its destination. i suspect that if the sample files were much longer (so they'd cause multiple chunks to be emitted), the earlier processing results would indeed show up at the destination. 

we all now that Node won't exit by itself as long as there's still something on the event loop, so you can use `setTimeout( fn, 1e6 )` to force the process not to terminate. i tried `setImmediate` to make data passing 'more asynchronous' (you see i'm guessing here—what does it really mean that `through` is 'synchronous'—does that mean it is inherently unsuited for asynchronous tasks? do i have to use `ES.readable` instead?), and indeed with that trick i managed to get _some_ (but crucially not all) data items through. 

from the experience my intense dabbling with NodeJS streams for the better part of this year i'd say that when i open a readstream and call `s.pause()` on it (say from within a piped transformer), then that should make NodeJS hand indefinitely until i call `s.resume()`—otherwise i haven't paused at all. that seemingly does not work; also, you have to be careful where to put the `pause` call to avoid getting a `unable to get back to old streams mode at this point`.

i should be very grateful if some knowledgeable people could fill me out on my doubts, misunderstandings, and lacunae. streams are great!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to do asynchronous, secondary reads from files and databases #27

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

how to do asynchronous, secondary reads from files and databases #27

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions