A pipeline to phase DNA methylation from HiFi reads in a human pedigree (including, as a special case, a trio) to the haplotypes of the pedigree's founders.
We assume the following command-line tools are in the user's PATH:
bgzip,tabix,bcftoolsgtg-ped-map,gtg-concordance(https://github.com/Platinum-Pedigree-Consortium/Platinum-Pedigree-Inheritance/blob/e12aca6b49ee7208952467db4a2a9e2f79b98efb/HAPLOTYPING.md)hiphase(https://github.com/PacificBiosciences/HiPhase)aligned_bam_to_cpg_scores(https://github.com/PacificBiosciences/pb-CpG-tools)
Install the python dependencies:
/path/to/python3.11 -m venv .venv
source .venv/bin/activate
# https://github.com/brentp/cyvcf2?tab=readme-ov-file#github-building-htslib-and-cyvcf2-from-source
cd $HOME
git clone --recursive https://github.com/brentp/cyvcf2
cd cyvcf2
pip install -r requirements.txt
CYVCF2_HTSLIB_MODE=EXTERNAL python setup.py install
cd /path/to/tapestry
pip install -r requirements.txt
Install bedGraphToBigWig into the virtual environment's bin directory:
# macOS:
curl -O https://hgdownload.cse.ucsc.edu/admin/exe/macOSX.x86_64/bedGraphToBigWig
chmod +x bedGraphToBigWig
mv bedGraphToBigWig .venv/bin/
# Linux (the latest UCSC binary requires glibc ≥ 2.34, so use an older version):
cp ~/bin/bedGraphToBigWig .venv/bin/
- Phase variants:
- Phase variants using read-backed phasing with the
run-hiphase.shscript. - Build an inheritance-based haplotype map and inheritance-phase variants using the
build-iht-based-haplotype-map-and-phase-variants.shscript.
- Phase variants using read-backed phasing with the
- Use
aligned_bam_to_cpg_scores.shto generate count- and model-based methylation levels from the haplotagged BAM files produced in step 1. - Phase count- and model-based methylation data to founder haplotypes using the
phase_meth_to_founder_haps.shscript, which uses the data produced in steps 1 and 2. - Use
expand_to_all_cpgs.shscript to generalize tapestry's output to include all CpG sites in the reference and sample genome, and unphased count- and model-based methylation levels, where available. This step also uses heterozygous SNVs to flag CpG sites that are "allele-specific", i.e., exist in only one of the haplotypes in the sample in question. Uses output of steps 2 and 3.
The phase_meth_to_founder_haps.sh script calls a CL tool called src/phase_meth_to_founder_haps.py:
vcf_read_phased="${read_phased_dir}/${uid}.GRCh38.deepvariant.glnexus.phased.vcf.gz" # single-sample vcf from hiphase
tsv_read_phase_blocks="${read_phased_dir}/${uid}.GRCh38.hiphase.blocks.tsv" # single-sample tsv from hiphase
vcf_iht_phased="${iht_phased_dir}/CEPH1463.GRCh38.pass.sorted.vcf.gz" # joint-called multi-sample vcf from gtg-ped-map/gtg-concordance
txt_iht_blocks="${iht_phased_dir}/CEPH1463.GRCh38.iht.sorted.txt" # multi-sample iht blocks file from gtg-ped-map/gtg-concordance
bed_meth_count_hap1="${meth_count_read_phased_dir}/${uid}.GRCh38.haplotagged.hap1.bed.gz" # bed file of count-based methylation from aligned_bam_to_cpg_scores for hap1
bed_meth_count_hap2="${meth_count_read_phased_dir}/${uid}.GRCh38.haplotagged.hap2.bed.gz" # bed file of count-based methylation from aligned_bam_to_cpg_scores for hap2
bed_meth_model_hap1="${meth_model_read_phased_dir}/${uid}.GRCh38.haplotagged.hap1.bed.gz" # bed file of model-based methylation from aligned_bam_to_cpg_scores for hap1
bed_meth_model_hap2="${meth_model_read_phased_dir}/${uid}.GRCh38.haplotagged.hap2.bed.gz" # bed file of model-based methylation from aligned_bam_to_cpg_scores for hap2
nohup python src/phase_meth_to_founder_haps.py \
--uid ${uid} \
--vcf_read_phased ${vcf_read_phased} \
--tsv_read_phase_blocks ${tsv_read_phase_blocks} \
--vcf_iht_phased ${vcf_iht_phased} \
--txt_iht_blocks ${txt_iht_blocks} \
--bed_meth_count_hap1 ${bed_meth_count_hap1} \
--bed_meth_count_hap2 ${bed_meth_count_hap2} \
--bed_meth_model_hap1 ${bed_meth_model_hap1} \
--bed_meth_model_hap2 ${bed_meth_model_hap2} \
--output_dir ${output_dir} \
> ${output_dir}/${uid}.log 2>&1 &
This will produce a log file that looks like:
2025-10-06 14:41:36 - INFO - Got read-based phasing data: 2445352 rows, 8 columns
2025-10-06 14:41:37 - INFO - Got read-based phase blocks: 10017 rows, 7 columns
2025-10-06 14:44:45 - INFO - Got inheritance-based phasing data: 2192365 rows, 7 columns
2025-10-06 14:44:45 - INFO - Got inheritance-based phase blocks: 1329 rows, 5 columns
2025-10-06 14:44:52 - INFO - Got all phasing data: 2189885 rows, 15 columns
2025-10-06 14:44:59 - INFO - Got hap map: 8904 rows, 7 columns
2025-10-06 14:44:59 - INFO - Got sites: 2189885 rows, 5 columns
2025-10-06 14:44:59 - INFO - Got sites where read-based and inheritance-based bit vectors don't match: 33390 rows, 5 columns
2025-10-06 14:45:00 - INFO - Wrote paternal and maternal hap-map blocks for IGV visualization
2025-10-06 14:45:00 - INFO - Wrote hap-map blocks
2025-10-06 14:45:00 - INFO - Wrote sites of bit-vectors, and sites where bit vectors are mismatched, for IGV visualization
2025-10-06 14:45:31 - INFO - Got read-based phasing of count-based methylation levels: 26729958 rows, 7 columns
2025-10-06 14:45:58 - INFO - Got read-based phasing of model-based methylation levels: 26729958 rows, 7 columns
2025-10-06 14:47:43 - INFO - Phased count-based methylation levels to founder haplotypes: 26729958 rows, 14 columns
2025-10-06 14:49:26 - INFO - Phased model-based methylation levels to founder haplotypes: 26729958 rows, 14 columns
2025-10-06 14:49:52 - INFO - Combined count- and model-based methylation levels: 26729958 rows, 16 columns
2025-10-06 14:49:52 - INFO - Percentage of CpG sites that are phased to founder haplotypes: 93%
2025-10-06 14:49:52 - INFO - Percentage of CpG sites that are within 50bp of a mismatch site: 0.183%
2025-10-06 14:49:57 - INFO - Wrote methylation levels phased to founder haplotypes
2025-10-06 14:52:15 - INFO - Wrote bigwig file for pat count-based methylation levels
2025-10-06 14:54:37 - INFO - Wrote bigwig file for pat model-based methylation levels
2025-10-06 14:56:54 - INFO - Wrote bigwig file for mat count-based methylation levels
2025-10-06 14:59:16 - INFO - Wrote bigwig file for mat model-based methylation levels
2025-10-06 14:59:17 - INFO - Done running /scratch/ucgd/lustre-labs/quinlan/u6018199/tapestry/src/phase_meth_to_founder_haps.py
and phases count- and model-based DNA methylation to founder haplotypes. The tool also creates files that collectively enable visualization of phased DNA methylation in IGV, e.g.,
Step 4 of pedigree workflow: Expanding the dataset to include all CpG sites in the reference and the sample genome, and to include unphased DNA methylation, and to flag CpGs as "allele-specific"
We would like to be aware of the existence of all CpG sites in the reference genome, irrespective of whether phased or unphased methylation levels are reported for those sites. Additionally, DNA methylation levels are computed at all CpG sites in the sample's haplotypes, even if those CpG sites are not in the reference genome (an example of a variant "creating" a CpG site in the sample). Conversely, CpG sites can be destroyed by mutation, and therefore a methylation level is not reported at that site on the corresponding haplotype. We call such CpG sites "allele-specific". Finally, this step of the workflow reports various QC statistics, e.g., the percentage of CpG sites (in reference and sample genomes, and on phasable chromosomes) at which count-based methylation is phased to at least one parental haplotype. See https://github.com/quinlan-lab/tapestry/blob/main/src/expand_to_all_cpgs.200081.ipynb for examples of all these things.
The bed file output by step 4 includes a header containing run metadata and column headings. The ##source= line records the script path followed by with args and a dictionary of the command-line arguments passed to src/expand_to_all_cpgs.py. The final header line lists the column names in the form "#col1 col2 col3...". An example header is shown below:
##source='/scratch/ucgd/lustre-labs/quinlan/u6018199/tapestry/src/expand_to_all_cpgs.py with args {'bed_all_cpgs_in_reference': '/scratch/ucgd/lustre-labs/quinlan/data-shared/dna-methylation/CEPH1463.GRCh38.hifi.founder-phased.all-cpgs.2/all_cpg_sites_in_reference.bed', 'bed_meth_count_unphased': '/scratch/ucgd/lustre-labs/quinlan/data-shared/dna-methylation/CEPH1463.GRCh38.hifi.count.read-backed-phased/200084.GRCh38.haplotagged.combined.bed.gz', 'bed_meth_model_unphased': '/scratch/ucgd/lustre-labs/quinlan/data-shared/dna-methylation/CEPH1463.GRCh38.hifi.model.read-backed-phased/200084.GRCh38.haplotagged.combined.bed.gz', 'bed_meth_founder_phased': '/scratch/ucgd/lustre-labs/quinlan/data-shared/dna-methylation/CEPH1463.GRCh38.hifi.founder-phased/200084.dna-methylation.founder-phased.bed', 'bed_het_site_mismatches': '/scratch/ucgd/lustre-labs/quinlan/data-shared/dna-methylation/CEPH1463.GRCh38.hifi.founder-phased/200084.bit-vector-sites-mismatches.bed', 'bed_meth_founder_phased_all_cpgs': '/scratch/ucgd/lustre-labs/quinlan/data-shared/dna-methylation/CEPH1463.GRCh38.hifi.founder-phased.all-cpgs.2/200084.dna-methylation.founder-phased.all_cpgs.bed', 'uid': '200084', 'vcf_iht_phased': '/scratch/ucgd/lustre-labs/quinlan/data-shared/haplotype-maps/CEPH1463.GRCh38/CEPH1463.GRCh38.pass.sorted.vcf.gz'}'
#chrom start_cpg end_cpg total_read_count methylation_level_count methylation_level_model start_hap_map_block end_hap_map_block haplotype_concordance_in_hap_map_block num_het_SNVs_in_hap_map_block total_read_count_pat total_read_count_mat founder_haplotype_pat founder_haplotype_mat methylation_level_pat_count methylation_level_mat_count methylation_level_pat_model methylation_level_mat_model cpg_is_within_50bp_of_mismatch_site cpg_overlaps_at_least_one_snv snv_genotypes cpg_is_allele_specific
Definitions of column headings:
| Column | Definition |
|---|---|
| chrom | chromosome on which CpG site is found |
| start_cpg | start coordinate of CpG dinucleotide |
| end_cpg | end coordinate of CpG dinucleotide |
| total_read_count | total number of HiFi reads overlapping CpG site |
| methylation_level_count | methylation level: fraction of reads that are methylated (inferred from sequencing kinetics) at given CpG site, c.f., https://github.com/PacificBiosciences/pb-CpG-tools?tab=readme-ov-file#output-modes-and-option-details |
| methylation_level_model | methylation level: neural network maps sequencing kinetics at multiple CpG sites to methylation probability at a central CpG site, c.f., "Methods" section of https://www.pacb.com/wp-content/uploads/poster_saunders.pdf |
| start_hap_map_block | start of hap-map block (a genomic interval in which parental phasing of reads can be performed) |
| end_hap_map_block | end of hap-map block |
| haplotype_concordance_in_hap_map_block | the degree of similarity (on a scale of 0 to 1, with 1 representing perfect similarity) of the bit vectors: sequences of alleles observed at heterozygous sites on haplotypes inferred by read-backed vs inheritance-backed phasing |
| num_het_SNVs_in_hap_map_block | number of heterozygous sites in the given hap-map block |
| total_read_count_pat | number of reads from the (deduced) paternal haplotype |
| total_read_count_mat | number of reads from the (deduced) maternal haplotype |
| founder_haplotype_pat | a label indicating which of the haplotypes in a founder of the pedigree the paternal haplotype in the given individual was descended from |
| founder_haplotype_mat | a similar label corresponding to the maternal haplotype of the sample |
| methylation_level_pat_count | count-based methylation level on the paternal haplotype (fraction of paternal reads that are methylated) |
| methylation_level_mat_count | count-based methylation level on the maternal haplotype (fraction of maternal reads that are methylated) |
| methylation_level_pat_model | model-based methylation level on the paternal haplotype (deep-learning estimated methylation using only the paternal reads) |
| methylation_level_mat_model | model-based methylation level on the maternal haplotype (deep-learning estimated methylation using only the maternal reads) |
| cpg_is_within_50bp_of_mismatch_site | is the given CpG site within 50bp of a heterozygous site at which the read-based and inheritance-based bit vectors (allele sequences corresponding to each haplotype) are mismatched |
| cpg_overlaps_at_least_one_snv | does the CpG dinucleotide overlap an SNV? |
| snv_genotypes | are the SNVs (if any) that overlap CpG dinucleotide hom or het? |
| cpg_is_allele_specific | does the CpG dinucleotide appear in the reads corresponding to one haplotype, but not the reads of the other haplotype? |
- Phase variants using pedMEC-based trio phasing with the
run-whatshap.shscript. - Use
aligned_bam_to_cpg_scores.shto generate count- and model-based methylation levels from the haplotagged BAM files produced in step 1, for each trio member (kid, dad, and mom). - Phase count- and model-based methylation data in kid to parental haplotypes (A/B in dad, C/D in mom) using the
phase_meth_to_parent_haps.shscript, which uses the data produced in steps 1 and 2. - Use
expand_to_all_cpgs.trio.shto generalize tapestry's output to include all CpG sites in the reference and sample genomes, unphased count- and model-based methylation levels (for all three trio members, where available), and unphased bigwig files for IGV visualization. This step also uses heterozygous SNVs to flag CpG sites that are "allele-specific" for each family member individually. Uses output of steps 2 and 3. Seehttps://github.com/quinlan-lab/tapestry/blob/main/src/expand_to_all_cpgs.trio.ipynbfor examples.
Step 3 of trio workflow: Phasing count- and model-based DNA methylation in kid to parental haplotypes
The phase_meth_to_parent_haps.sh script calls src/phase_meth_to_parent_haps.py, which phases methylation levels in the kid to the corresponding haplotypes in the parents (A/B in dad, C/D in mom), and creates files for IGV visualization, e.g.,
Step 4 of trio workflow: Expanding the dataset to include all CpG sites, unphased methylation, and per-member allele-specific CpG labeling
This step mirrors step 4 of the pedigree-wise workflow but is adapted for a trio. The key differences are:
- Unphased methylation for all three members: count- and model-based unphased methylation levels are included for kid, dad, and mom (rather than just a single sample).
- Per-member allele-specific labeling: CpG sites are flagged as allele-specific independently for each family member (kid, dad, and mom), since a variant may be heterozygous in one member but homozygous in another.
- Separate paternal and maternal mismatch proximity: proximity to mismatch sites is computed separately for the paternal and maternal sides.
- Bigwig files: unphased methylation bigwig files are written for each trio member for IGV visualization.
Informative example of allele-specific methylation resulting from a SNP in the trio:
The bed file output by step 4 includes a header containing run metadata and column headings. The ##source= line records the script path followed by with args and a dictionary of the command-line arguments passed to src/expand_to_all_cpgs_trio.py. The final header line lists the column names in the form "#col1 col2 col3...".
| Column | Definition |
|---|---|
| chrom | chromosome on which CpG site is found |
| start_cpg | start coordinate of CpG dinucleotide |
| end_cpg | end coordinate of CpG dinucleotide |
| total_read_count_kid | total number of HiFi reads overlapping CpG site in the kid |
| methylation_level_kid_count | count-based unphased methylation level in the kid |
| methylation_level_kid_model | model-based unphased methylation level in the kid |
| total_read_count_dad | total number of HiFi reads overlapping CpG site in dad |
| methylation_level_dad_count | count-based unphased methylation level in dad |
| methylation_level_dad_model | model-based unphased methylation level in dad |
| total_read_count_mom | total number of HiFi reads overlapping CpG site in mom |
| methylation_level_mom_count | count-based unphased methylation level in mom |
| methylation_level_mom_model | model-based unphased methylation level in mom |
| total_read_count_kid_pat | number of reads from the kid's paternal haplotype |
| methylation_level_kid_pat_count | count-based methylation level on the kid's paternal haplotype |
| total_read_count_kid_mat | number of reads from the kid's maternal haplotype |
| methylation_level_kid_mat_count | count-based methylation level on the kid's maternal haplotype |
| methylation_level_kid_pat_model | model-based methylation level on the kid's paternal haplotype |
| methylation_level_kid_mat_model | model-based methylation level on the kid's maternal haplotype |
| total_read_count_dad_A | number of reads from dad's haplotype A (hap1) |
| methylation_level_dad_A_count | count-based methylation level on dad's haplotype A |
| total_read_count_dad_B | number of reads from dad's haplotype B (hap2) |
| methylation_level_dad_B_count | count-based methylation level on dad's haplotype B |
| methylation_level_dad_A_model | model-based methylation level on dad's haplotype A |
| methylation_level_dad_B_model | model-based methylation level on dad's haplotype B |
| total_read_count_mom_C | number of reads from mom's haplotype C (hap1) |
| methylation_level_mom_C_count | count-based methylation level on mom's haplotype C |
| total_read_count_mom_D | number of reads from mom's haplotype D (hap2) |
| methylation_level_mom_D_count | count-based methylation level on mom's haplotype D |
| methylation_level_mom_C_model | model-based methylation level on mom's haplotype C |
| methylation_level_mom_D_model | model-based methylation level on mom's haplotype D |
| start_hap_map_block_pat | start of the paternal hap-map block |
| end_hap_map_block_pat | end of the paternal hap-map block |
| paternal_haplotype | which of dad's haplotypes (A or B) corresponds to the kid's paternal haplotype in this block |
| paternal_concordance | concordance of read-based vs pedMEC-based bit vectors on the paternal side |
| num_het_SNVs_in_dad | number of heterozygous sites in dad within the paternal hap-map block |
| start_hap_map_block_mat | start of the maternal hap-map block |
| end_hap_map_block_mat | end of the maternal hap-map block |
| maternal_haplotype | which of mom's haplotypes (C or D) corresponds to the kid's maternal haplotype in this block |
| maternal_concordance | concordance of read-based vs pedMEC-based bit vectors on the maternal side |
| num_het_SNVs_in_mom | number of heterozygous sites in mom within the maternal hap-map block |
| cpg_is_within_50bp_of_mismatch_site_pat | is the CpG site within 50bp of a paternal heterozygous site at which the read-based and pedMEC-based bit vectors are mismatched |
| cpg_is_within_50bp_of_mismatch_site_mat | is the CpG site within 50bp of a maternal heterozygous site at which the read-based and pedMEC-based bit vectors are mismatched |
| cpg_overlaps_at_least_one_snv | does the CpG dinucleotide overlap an SNV? |
| snv_genotypes_kid | are the SNVs (if any) that overlap the CpG dinucleotide hom or het in the kid? |
| cpg_is_allele_specific_kid | does the CpG dinucleotide appear in the reads of only one haplotype in the kid? |
| snv_genotypes_dad | are the SNVs (if any) that overlap the CpG dinucleotide hom or het in dad? |
| cpg_is_allele_specific_dad | does the CpG dinucleotide appear in the reads of only one haplotype in dad? |
| snv_genotypes_mom | are the SNVs (if any) that overlap the CpG dinucleotide hom or het in mom? |
| cpg_is_allele_specific_mom | does the CpG dinucleotide appear in the reads of only one haplotype in mom? |
A small development dataset (region chr1:1000000-2000000 for the CEPH1463 trio) can be created and used to run and test the trio workflow locally. The following files support this:
| File | Description |
|---|---|
trio_dev_data_config.sh |
Defines the genomic region (DEV_REGION) used by all other dev data scripts. |
trio_dev_data_create.sh |
Creates the dev dataset by subsetting the production VCF and BAM files to the dev region. Run on a machine with access to the production data: ./trio_dev_data_create.sh trio_dev_data. |
trio_dev_data_get_ref.sh |
Downloads the reference FASTA for the dev region chromosome from UCSC. Only needs to be run once: ./trio_dev_data_get_ref.sh trio_dev_data. |
trio_dev_data_serve.py |
HTTP server with range-request support for serving the trio_dev_data/ directory to IGV. Start with .venv/bin/python trio_dev_data_serve.py (default port 8000). |
trio_dev_data_igv_session.xml |
IGV session file that loads the dev data (haplotagged BAMs, phased VCFs, phase blocks, methylation bigwigs, and mismatch-site VCFs) from the local server. Open in IGV via File > Open Session after starting the server. |
To run the trio workflow on the dev data, pass --dev-dir trio_dev_data to each pipeline script, e.g.:
./run-whatshap.sh --dev-dir trio_dev_data
./aligned_bam_to_cpg_scores.sh --dev-dir trio_dev_data
./phase_meth_to_parent_haps.sh --dev-dir trio_dev_data
./expand_to_all_cpgs.trio.sh --dev-dir trio_dev_data
- Convert manual workflow into a Snakemake workflow (see
Snakefilefor early version of this)



