Skip to content

Dev chunk optimization postprocessveppanel#390

Open
migrau wants to merge 52 commits intodevfrom
dev-chunk-optimization-POSTPROCESSVEPPANEL
Open

Dev chunk optimization postprocessveppanel#390
migrau wants to merge 52 commits intodevfrom
dev-chunk-optimization-POSTPROCESSVEPPANEL

Conversation

@migrau
Copy link
Copy Markdown
Member

@migrau migrau commented Nov 14, 2025

[copilot generated]

Performance Optimization: Chunked Processing for Large Panel Annotations

Overview

This PR introduces memory-efficient chunked processing for VEP annotation post-processing, enabling the pipeline to handle arbitrarily large panel annotations without memory constraints.

Changes Summary

✅ Implemented Chunking Optimizations

1. panel_postprocessing_annotation.py - Chunked VEP Output Processing

  • Chunk size: 100,000 lines
  • Implementation: Streaming pandas read with incremental output writing
  • Benefits:
    • Processes large VEP outputs without loading entire file into memory
    • Prevents OOM errors on panels with millions of variants
    • Maintains same output quality with predictable resource usage

Technical details:

chunk_size = 100000
reader = pd.read_csv(VEP_output_file, sep="\t", chunksize=chunk_size)

for i, chunk in enumerate(reader):
    processed_chunk = process_chunk(chunk, chosen_assembly, using_canonical)
    # Incremental write with header only on first chunk
    rich_out_file.write(processed_chunk.to_csv(header=(i == 0), index=False, sep="\t"))
    del processed_chunk
    gc.collect()  # Explicit memory cleanup

Process: CREATEPANELS:POSTPROCESSVEPPANEL

  • Takes per-chromosome output from VCFANNOTATEPANEL
  • Processes in 100k-line chunks
  • Status: ✅ Working successfully

2. panel_custom_processing.py - Chromosome-Based Chunked Loading

  • Chunk size: 1,000,000 lines
  • Strategy: Load only relevant chromosome data in chunks
  • Benefits:
    • Memory-efficient custom region annotation
    • Filters during read to minimize memory footprint

Technical details:

def load_chr_data_chunked(filepath, chrom, chunksize=1_000_000):
    reader = pd.read_csv(filepath, sep="\t", chunksize=chunksize, dtype={'CHROM': str})
    chr_data = []
    for chunk in reader:
        filtered = chunk[chunk["CHROM"] == chrom]
        if not filtered.empty:
            chr_data.append(filtered)
    return pd.concat(chr_data) if chr_data else pd.DataFrame()

Process: CUSTOMPROCESSING / CUSTOMPROCESSINGRICH

  • Processes custom genomic regions with updated annotations
  • Loads data per-chromosome to reduce memory usage

❌ VEP Cache Storage Location - No Performance Impact

What was tested:

  • Using VEP cache from beegfs storage (/workspace/datasets/vep or /data/bbg/datasets/vep)
  • Expected faster cache access vs. downloading on-the-fly

Results:

  • No significant runtime improvement for ENSEMBLVEP_VEP process
  • VEP annotation runtime is compute-bound, not I/O-bound
  • Network-attached storage performed equivalently to local cache
  • OS filesystem caching likely mitigates storage location differences

Commits:

  • 035a0c7 (April 3, 2025): Added VEP cache beegfs support
  • 8e40d83 (April 24, 2025): Removed VEP cache beegfs optimization (no benefit)

Current approach:

  • Cache location configurable via params.vep_cache
  • Defaults to downloading cache if not provided
  • Various config files specify beegfs paths for convenience, not performance

Resource Configuration

Updated resource limits for chunked processes:

withName: '(BBGTOOLS:DEEPCSA:CREATEPANELS:POSTPROCESSVEPPANEL*|...)' {
    cpus   = { 2 * task.attempt }
    memory = { 4.GB * task.attempt }
    time   = { 360.min * task.attempt }
}

Integration Points

Affected Subworkflows:

  • CREATEPANELSPOSTPROCESSVEPPANEL → processes VEP output in chunks
  • CUSTOMPROCESSING / CUSTOMPROCESSINGRICH → uses chunked loading for custom regions

Pipeline Flow:

SITESFROMPOSITIONS → VCFANNOTATEPANEL (VEP) 
    ↓
POSTPROCESSVEPPANEL (chunked processing) ← 100k line chunks
    ↓
CUSTOMPROCESSING (optional, chunked by chromosome)
    ↓
CREATECAPTUREDPANELS / CREATESAMPLEPANELS / CREATECONSENSUSPANELS

Testing

Tested on:

  • Large-scale panels (millions of variants)
  • Multiple configuration profiles (nanoseq, chip, kidney, etc.)

Validation:

  • Output correctness verified (same results as non-chunked version)
  • Memory usage remains stable across panel sizes
  • No OOM errors on large inputs

Performance Impact

Metric Before After
Memory usage Unbounded (full file in RAM) ~4 GB (controlled)
Max panel size Limited by available memory Unlimited
Runtime Similar Similar (no regression)
Reliability OOM on large panels Stable processing

Migration Notes

No breaking changes. Existing pipelines continue to work with improved memory efficiency.

Related Commits

  • 276152d: Chunking for panel_custom_processing.py
  • 035a0c7: VEP cache beegfs attempt (added)
  • 8e40d83: VEP cache beegfs removal (no performance gain)
  • Various fixes: 1dffd94, 945c129, d243ebc, etc. (resource tuning)

Conclusion

This PR successfully implements memory-efficient chunked processing for panel annotation post-processing, enabling the pipeline to scale to arbitrarily large panels without memory constraints. The VEP cache storage location experiment confirmed that computation, not I/O, is the bottleneck for annotation runtime.

migrau and others added 6 commits November 14, 2025 18:26
Implemented parallel processing of VEP annotation through configurable chunking:

- Added `panel_sites_chunk_size` parameter (default: 0, no chunking)
  - When >0, splits sites file into chunks for parallel VEP annotation
  - Uses bash `split` command for efficient chunking with preserved headers

- Modified SITESFROMPOSITIONS module:
  - Outputs multiple chunk files (*.sites4VEP.chunk*.tsv) instead of single file
  - Logs chunk configuration and number of chunks created
  - Chunk size configurable via `ext.chunk_size` in modules.config

- Updated CREATE_PANELS workflow:
  - Flattens chunks with `.transpose()` for parallel processing
  - Each chunk gets unique ID for VEP tracking
  - Merges chunks using `collectFile` with header preservation

- Added SORT_MERGED_PANEL module:
  - Sorts merged panels by chromosome and position (genomic order)
  - Prevents "out of order" errors in downstream BED operations
  - Applied to both compact and rich annotation outputs

- Enhanced logging across chunking pipeline:
  - SITESFROMPOSITIONS: reports chunk_size and number of chunks created
  - POSTPROCESS_VEP_ANNOTATION: shows internal chunk_size and expected chunks
  - CUSTOM_ANNOTATION_PROCESSING: displays chr_chunk_size and processing info

Configuration:
  - `panel_sites_chunk_size`: controls file chunking (0=disabled)
  - `panel_postprocessing_chunk_size`: internal memory management
  - `panel_custom_processing_chunk_size`: internal chromosome chunking

Benefits:
  - Parallelizes VEP annotation for large panels
  - Reduces memory footprint per task
  - Maintains genomic sort order for downstream tools
@FerriolCalvet FerriolCalvet self-requested a review November 28, 2025 11:05
Copy link
Copy Markdown
Collaborator

@FerriolCalvet FerriolCalvet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went over all the files and these are some of the comments, in general I think that these are the main points:

  • one bigger change is to not parallelize the processing of Ensembl VEP annotation, but keep the paralellization to splitting the input.
  • Also the chunking for the custom processing of the panel is a good idea but I am not sure that the implementation is correct, it should be revised.
  • Add omega snapshot as part of the test

Once these details are solved, it would be great to merge the dev branch here (solve conflicts) and confirm that all the tests are passing

  • Merge with the dev branch and update the tests snapshots in case it is needed


// === SENSIBLE DEFAULTS ===
// Most processes use minimal resources based on usage analysis
cpus = { 1 }
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is OK, but I think we should revise that all the steps that might be able to use multiple threads get at least the chance of increasing the number of CPUs in the new attempts

(nothing to change just a heads-up on this topic)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a quick check, pending to review: OMEGA_ESTIMATOR and SIGPROFILERASSIGNMENT

Comment on lines +135 to +140
chr_data = chr_data.drop_duplicates(
subset=['CHROM', 'POS', 'REF', 'ALT', 'MUT_ID', 'GENE', 'CONTEXT_MUT', 'CONTEXT', 'IMPACT'],
keep='first'
)
chr_data.to_csv(customized_output_annotation_file, header=True, index=False, sep="\t")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this does the same as it was doing before, because it is supposed to output all the same TSV table but replacing the values in some of the rows, in this case it seems that only the information from the last chromosome will be outputted, but maybe I got it wrong

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

true. Reviewed and improved.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the error that is still not fixed, only the information from one chromosome is outputted.

I will work now on solving it

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@migrau I see this was not fixed the easiest way to handle it for me was to revert the change to the previous implementation here : https://github.com/bbglab/deepCSA/tree/custom-processing-fix

@FerriolCalvet FerriolCalvet linked an issue Jan 5, 2026 that may be closed by this pull request
@FerriolCalvet FerriolCalvet added this to the Phase 2 milestone Jan 5, 2026
@migrau migrau requested a review from FerriolCalvet March 19, 2026 18:54
@m-huertasp
Copy link
Copy Markdown
Collaborator

Hi! While checking the cord bloods run (combining DupCaller and deepUMI callings) I saw that one of the places in which we have a bottleneck is in the postprocessveppanel and I found this PR. I've checked the python script and I think it could also be a bit more optimized by using polars instead of pandas (now that we have it in the deepCSA container) and trying to change the "apply" logic in some places. Just a "heads-up" to say that it could be done in the future so I don't forget.

Copy link
Copy Markdown
Collaborator

@FerriolCalvet FerriolCalvet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good Miguel!

I left some comments and suggestions.
nothing critical.

  • only some minor fixes to pass the nextflow linting
  • update default chunk_size to 1M so that bigger panels get chunked

and then another comment is that maybe we might need to be more generous in terms of memory of some steps when running bigger cohorts, but we will see this as we start using it.

I would apply the suggestions if you agree and then merge it to dev so that it starts to get tested by all of us and we tune it from there

thankss!!

"plot_mutation_specific_qc": {
"type": "boolean",
"description": "Do you want to generate mutation-specific QC plots (VAF vs depth)?",
"fa_icon": "fas fa-book"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"fa_icon": "fas fa-book"
"fa_icon": "fas fa-book",
"hidden": true,
"default": true

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would leave this as true by default and hide it

"panel_sites_chunk_size": {
"type": "integer",
"description": "Number of sites per chunk for parallel VEP annotation (0 = no chunking)",
"default": 0,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"default": 0,
"default": 1000000,

selected_genes = ''
panel_with_canonical = true
panel_custom_processing_chunk_size = 1000000 // a very big number will avoid chunking by default
panel_sites_chunk_size = 0 // 0 means no chunking (default), set to positive integer to enable chunking
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
panel_sites_chunk_size = 0 // 0 means no chunking (default), set to positive integer to enable chunking
panel_sites_chunk_size = 1000000 // 0 means no chunking (default), set to positive integer to enable chunking

Comment on lines +65 to +76
POSTPROCESSVEPPANEL.out.compact_panel_annotation
.map{ it[1] }
.collectFile(name: 'captured_panel.vep.annotation.tsv', keepHeader: true, skip: 1)
.map{ file -> [[ id : "captured_panel"], file] }
.set{ merged_compact_unsorted }

POSTPROCESSVEPPANEL.out.rich_panel_annotation
.map{ it[1] }
.collectFile(name: 'captured_panel.vep.annotation.rich.tsv', keepHeader: true, skip: 1)
.map{ file -> [[ id : "captured_panel"], file] }
.set{ merged_rich_unsorted }

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
POSTPROCESSVEPPANEL.out.compact_panel_annotation
.map{ it[1] }
.collectFile(name: 'captured_panel.vep.annotation.tsv', keepHeader: true, skip: 1)
.map{ file -> [[ id : "captured_panel"], file] }
.set{ merged_compact_unsorted }
POSTPROCESSVEPPANEL.out.rich_panel_annotation
.map{ it[1] }
.collectFile(name: 'captured_panel.vep.annotation.rich.tsv', keepHeader: true, skip: 1)
.map{ file -> [[ id : "captured_panel"], file] }
.set{ merged_rich_unsorted }
POSTPROCESSVEPPANEL.out.compact_panel_annotation
.map{ it -> it[1] }
.collectFile(name: 'captured_panel.vep.annotation.tsv', keepHeader: true, skip: 1)
.map{ file -> [[ id : "captured_panel"], file] }
.set{ merged_compact_unsorted }
POSTPROCESSVEPPANEL.out.rich_panel_annotation
.map{ it -> it[1] }
.collectFile(name: 'captured_panel.vep.annotation.rich.tsv', keepHeader: true, skip: 1)
.map{ file -> [[ id : "captured_panel"], file] }
.set{ merged_rich_unsorted }

// Flatten chunks and create tuples for VEP annotation
SITESFROMPOSITIONS.out.annotated_panel_reg
.transpose()
.map{ meta, chunk ->
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.map{ meta, chunk ->
.map{ _meta, chunk ->

added_regions = channel.empty()
complete_annotated_panel = merged_compact
rich_annotated = merged_rich
added_regions = Channel.empty()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
added_regions = Channel.empty()
added_regions = channel.empty()

label 'process_single'

conda "python=3.10.17 bioconda::pybedtools=0.12.0 conda-forge::polars=1.30.0 conda-forge::click=8.2.1 conda-forge::gcc_linux-64=15.1.0 conda-forge::gxx_linux-64=15.1.0"
container 'docker://bbglab/deepcsa_bed:latest'
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the receipe of this container is not pushed to https://github.com/bbglab/containers-recipes
if you have it localized somewhere try to push it so that we have everything centralized there, but go ahead with the merge

@FerriolCalvet
Copy link
Copy Markdown
Collaborator

Wait Miguel, we found some weird behaviour in the test run with the cord bloods I will let you know once we solve it

@migrau
Copy link
Copy Markdown
Member Author

migrau commented Mar 22, 2026

the results from omega were not deterministic, e.g:
image

Now, the test rounds numeric columns (dnds, pvalue, lower, upper) to 2 decimals before comparison.

On the other hand, the implementation of polars in bin/panel_custom_processing.py is much faster than pandas but it requires ~30% of RAM, because there is not chunking. Results tested and compared with the previous implementation and they have the same md5sum.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment