Conversation
…. Upgrade pybedtools. Added wave
…annotating with no flags."
…eated in create_consensus_panel.py
Implemented parallel processing of VEP annotation through configurable chunking: - Added `panel_sites_chunk_size` parameter (default: 0, no chunking) - When >0, splits sites file into chunks for parallel VEP annotation - Uses bash `split` command for efficient chunking with preserved headers - Modified SITESFROMPOSITIONS module: - Outputs multiple chunk files (*.sites4VEP.chunk*.tsv) instead of single file - Logs chunk configuration and number of chunks created - Chunk size configurable via `ext.chunk_size` in modules.config - Updated CREATE_PANELS workflow: - Flattens chunks with `.transpose()` for parallel processing - Each chunk gets unique ID for VEP tracking - Merges chunks using `collectFile` with header preservation - Added SORT_MERGED_PANEL module: - Sorts merged panels by chromosome and position (genomic order) - Prevents "out of order" errors in downstream BED operations - Applied to both compact and rich annotation outputs - Enhanced logging across chunking pipeline: - SITESFROMPOSITIONS: reports chunk_size and number of chunks created - POSTPROCESS_VEP_ANNOTATION: shows internal chunk_size and expected chunks - CUSTOM_ANNOTATION_PROCESSING: displays chr_chunk_size and processing info Configuration: - `panel_sites_chunk_size`: controls file chunking (0=disabled) - `panel_postprocessing_chunk_size`: internal memory management - `panel_custom_processing_chunk_size`: internal chromosome chunking Benefits: - Parallelizes VEP annotation for large panels - Reduces memory footprint per task - Maintains genomic sort order for downstream tools
There was a problem hiding this comment.
I went over all the files and these are some of the comments, in general I think that these are the main points:
- one bigger change is to not parallelize the processing of Ensembl VEP annotation, but keep the paralellization to splitting the input.
- Also the chunking for the custom processing of the panel is a good idea but I am not sure that the implementation is correct, it should be revised.
- Add omega snapshot as part of the test
Once these details are solved, it would be great to merge the dev branch here (solve conflicts) and confirm that all the tests are passing
- Merge with the dev branch and update the tests snapshots in case it is needed
|
|
||
| // === SENSIBLE DEFAULTS === | ||
| // Most processes use minimal resources based on usage analysis | ||
| cpus = { 1 } |
There was a problem hiding this comment.
I think this is OK, but I think we should revise that all the steps that might be able to use multiple threads get at least the chance of increasing the number of CPUs in the new attempts
(nothing to change just a heads-up on this topic)
There was a problem hiding this comment.
From a quick check, pending to review: OMEGA_ESTIMATOR and SIGPROFILERASSIGNMENT
bin/panel_custom_processing.py
Outdated
| chr_data = chr_data.drop_duplicates( | ||
| subset=['CHROM', 'POS', 'REF', 'ALT', 'MUT_ID', 'GENE', 'CONTEXT_MUT', 'CONTEXT', 'IMPACT'], | ||
| keep='first' | ||
| ) | ||
| chr_data.to_csv(customized_output_annotation_file, header=True, index=False, sep="\t") | ||
|
|
There was a problem hiding this comment.
I am not sure this does the same as it was doing before, because it is supposed to output all the same TSV table but replacing the values in some of the rows, in this case it seems that only the information from the last chromosome will be outputted, but maybe I got it wrong
There was a problem hiding this comment.
true. Reviewed and improved.
There was a problem hiding this comment.
this is the error that is still not fixed, only the information from one chromosome is outputted.
I will work now on solving it
There was a problem hiding this comment.
@migrau I see this was not fixed the easiest way to handle it for me was to revert the change to the previous implementation here : https://github.com/bbglab/deepCSA/tree/custom-processing-fix
…olving conflicts.
…ostprocessing_annotation.py. panel_postprocessing_chunk_size deleted.
…nd within-chrom chunking optional
|
Hi! While checking the cord bloods run (combining DupCaller and deepUMI callings) I saw that one of the places in which we have a bottleneck is in the |
There was a problem hiding this comment.
Looks good Miguel!
I left some comments and suggestions.
nothing critical.
- only some minor fixes to pass the nextflow linting
- update default chunk_size to 1M so that bigger panels get chunked
and then another comment is that maybe we might need to be more generous in terms of memory of some steps when running bigger cohorts, but we will see this as we start using it.
I would apply the suggestions if you agree and then merge it to dev so that it starts to get tested by all of us and we tune it from there
thankss!!
| "plot_mutation_specific_qc": { | ||
| "type": "boolean", | ||
| "description": "Do you want to generate mutation-specific QC plots (VAF vs depth)?", | ||
| "fa_icon": "fas fa-book" |
There was a problem hiding this comment.
| "fa_icon": "fas fa-book" | |
| "fa_icon": "fas fa-book", | |
| "hidden": true, | |
| "default": true |
There was a problem hiding this comment.
I would leave this as true by default and hide it
| "panel_sites_chunk_size": { | ||
| "type": "integer", | ||
| "description": "Number of sites per chunk for parallel VEP annotation (0 = no chunking)", | ||
| "default": 0, |
There was a problem hiding this comment.
| "default": 0, | |
| "default": 1000000, |
| selected_genes = '' | ||
| panel_with_canonical = true | ||
| panel_custom_processing_chunk_size = 1000000 // a very big number will avoid chunking by default | ||
| panel_sites_chunk_size = 0 // 0 means no chunking (default), set to positive integer to enable chunking |
There was a problem hiding this comment.
| panel_sites_chunk_size = 0 // 0 means no chunking (default), set to positive integer to enable chunking | |
| panel_sites_chunk_size = 1000000 // 0 means no chunking (default), set to positive integer to enable chunking |
| POSTPROCESSVEPPANEL.out.compact_panel_annotation | ||
| .map{ it[1] } | ||
| .collectFile(name: 'captured_panel.vep.annotation.tsv', keepHeader: true, skip: 1) | ||
| .map{ file -> [[ id : "captured_panel"], file] } | ||
| .set{ merged_compact_unsorted } | ||
|
|
||
| POSTPROCESSVEPPANEL.out.rich_panel_annotation | ||
| .map{ it[1] } | ||
| .collectFile(name: 'captured_panel.vep.annotation.rich.tsv', keepHeader: true, skip: 1) | ||
| .map{ file -> [[ id : "captured_panel"], file] } | ||
| .set{ merged_rich_unsorted } | ||
|
|
There was a problem hiding this comment.
| POSTPROCESSVEPPANEL.out.compact_panel_annotation | |
| .map{ it[1] } | |
| .collectFile(name: 'captured_panel.vep.annotation.tsv', keepHeader: true, skip: 1) | |
| .map{ file -> [[ id : "captured_panel"], file] } | |
| .set{ merged_compact_unsorted } | |
| POSTPROCESSVEPPANEL.out.rich_panel_annotation | |
| .map{ it[1] } | |
| .collectFile(name: 'captured_panel.vep.annotation.rich.tsv', keepHeader: true, skip: 1) | |
| .map{ file -> [[ id : "captured_panel"], file] } | |
| .set{ merged_rich_unsorted } | |
| POSTPROCESSVEPPANEL.out.compact_panel_annotation | |
| .map{ it -> it[1] } | |
| .collectFile(name: 'captured_panel.vep.annotation.tsv', keepHeader: true, skip: 1) | |
| .map{ file -> [[ id : "captured_panel"], file] } | |
| .set{ merged_compact_unsorted } | |
| POSTPROCESSVEPPANEL.out.rich_panel_annotation | |
| .map{ it -> it[1] } | |
| .collectFile(name: 'captured_panel.vep.annotation.rich.tsv', keepHeader: true, skip: 1) | |
| .map{ file -> [[ id : "captured_panel"], file] } | |
| .set{ merged_rich_unsorted } | |
| // Flatten chunks and create tuples for VEP annotation | ||
| SITESFROMPOSITIONS.out.annotated_panel_reg | ||
| .transpose() | ||
| .map{ meta, chunk -> |
There was a problem hiding this comment.
| .map{ meta, chunk -> | |
| .map{ _meta, chunk -> |
| added_regions = channel.empty() | ||
| complete_annotated_panel = merged_compact | ||
| rich_annotated = merged_rich | ||
| added_regions = Channel.empty() |
There was a problem hiding this comment.
| added_regions = Channel.empty() | |
| added_regions = channel.empty() |
| label 'process_single' | ||
|
|
||
| conda "python=3.10.17 bioconda::pybedtools=0.12.0 conda-forge::polars=1.30.0 conda-forge::click=8.2.1 conda-forge::gcc_linux-64=15.1.0 conda-forge::gxx_linux-64=15.1.0" | ||
| container 'docker://bbglab/deepcsa_bed:latest' |
There was a problem hiding this comment.
I think that the receipe of this container is not pushed to https://github.com/bbglab/containers-recipes
if you have it localized somewhere try to push it so that we have everything centralized there, but go ahead with the merge
|
Wait Miguel, we found some weird behaviour in the test run with the cord bloods I will let you know once we solve it |
…ove chunking logic
…tion. Update omega snapshot with 2 decimals
…oved processing. Resources adjusted

[copilot generated]
Performance Optimization: Chunked Processing for Large Panel Annotations
Overview
This PR introduces memory-efficient chunked processing for VEP annotation post-processing, enabling the pipeline to handle arbitrarily large panel annotations without memory constraints.
Changes Summary
✅ Implemented Chunking Optimizations
1.
panel_postprocessing_annotation.py- Chunked VEP Output ProcessingTechnical details:
Process:
CREATEPANELS:POSTPROCESSVEPPANELVCFANNOTATEPANEL2.
panel_custom_processing.py- Chromosome-Based Chunked LoadingTechnical details:
Process:
CUSTOMPROCESSING/CUSTOMPROCESSINGRICH❌ VEP Cache Storage Location - No Performance Impact
What was tested:
/workspace/datasets/vepor/data/bbg/datasets/vep)Results:
ENSEMBLVEP_VEPprocessCommits:
035a0c7(April 3, 2025): Added VEP cache beegfs support8e40d83(April 24, 2025): Removed VEP cache beegfs optimization (no benefit)Current approach:
params.vep_cacheResource Configuration
Updated resource limits for chunked processes:
Integration Points
Affected Subworkflows:
CREATEPANELS→POSTPROCESSVEPPANEL→ processes VEP output in chunksCUSTOMPROCESSING/CUSTOMPROCESSINGRICH→ uses chunked loading for custom regionsPipeline Flow:
Testing
Tested on:
Validation:
Performance Impact
Migration Notes
No breaking changes. Existing pipelines continue to work with improved memory efficiency.
Related Commits
276152d: Chunking forpanel_custom_processing.py035a0c7: VEP cache beegfs attempt (added)8e40d83: VEP cache beegfs removal (no performance gain)1dffd94,945c129,d243ebc, etc. (resource tuning)Conclusion
This PR successfully implements memory-efficient chunked processing for panel annotation post-processing, enabling the pipeline to scale to arbitrarily large panels without memory constraints. The VEP cache storage location experiment confirmed that computation, not I/O, is the bottleneck for annotation runtime.