there's the "one big step" multithreading branch and it is a pain to keep it updated with changes from master.
while thinking about the issues there (ordering, race conditions, crypto) the idea of "sequential threading" connected by queue.Queue came to mind (it intentionally does not use parallelism on same phase of processing, thus only 1 thread per stage):
finder -q- reader -q- id-hasher -q- compressor -q- encryptor -q- writer
finder: just discovers pathnames to back up (obeying includes, excludes, --one-file-system, etc.)
reader: reads and chunks a file
hasher: computes id-hash of a chunk so we can check whether we already have it
compressor: compresses a chunk
encryptor: encrypts a chunk
writer: writes stuff to the repo
A side effect of such a staged processing with workers approach is that the code gets untwisted, stages clearly separated and they communicate over well-defined data structures passed over the queues.
The full-blown implementation of this needs not to be done in one go, we can start with lesser stages, e.g.:
finder/reader -q- hasher/compressor/encryptor -q- writer
this can solve: cpu sitting more or less idle while waiting for I/O to complete (read/seek time, write/sync time), i/o sitting idle while waiting for cpu-bound stuff to complete.
this can not (and should not) solve: very slow compression algorithms needing same-stage parallelism.
there's the "one big step" multithreading branch and it is a pain to keep it updated with changes from master.
while thinking about the issues there (ordering, race conditions, crypto) the idea of "sequential threading" connected by queue.Queue came to mind (it intentionally does not use parallelism on same phase of processing, thus only 1 thread per stage):
finder: just discovers pathnames to back up (obeying includes, excludes, --one-file-system, etc.)
reader: reads and chunks a file
hasher: computes id-hash of a chunk so we can check whether we already have it
compressor: compresses a chunk
encryptor: encrypts a chunk
writer: writes stuff to the repo
A side effect of such a staged processing with workers approach is that the code gets untwisted, stages clearly separated and they communicate over well-defined data structures passed over the queues.
The full-blown implementation of this needs not to be done in one go, we can start with lesser stages, e.g.:
this can solve: cpu sitting more or less idle while waiting for I/O to complete (read/seek time, write/sync time), i/o sitting idle while waiting for cpu-bound stuff to complete.
this can not (and should not) solve: very slow compression algorithms needing same-stage parallelism.