Skip to content

Commit 90a5414

Browse files
committed
update liftover
1 parent 256092c commit 90a5414

9 files changed

Lines changed: 1674 additions & 297 deletions

File tree

37_liftover/LiftOver.md

Lines changed: 209 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,209 @@
1+
# UCSC LiftOver Tool
2+
3+
UCSC `liftOver` is a command-line tool for converting genomic coordinates between different genome assemblies using chain files.
4+
5+
## Online Version
6+
7+
UCSC also provides a **web-based liftOver tool** for quick conversions without installation:
8+
9+
- **Online liftOver tool**: https://genome.ucsc.edu/cgi-bin/hgLiftOver
10+
- Upload a BED file or paste coordinates directly
11+
- Select source and target genome assemblies
12+
- Download results immediately
13+
14+
The online tool is convenient for small-scale conversions, while the command-line tool is recommended for batch processing and automation.
15+
16+
## Installation
17+
18+
Download the `liftOver` binary from UCSC:
19+
20+
- **Linux**: http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/liftOver
21+
- **macOS**: http://hgdownload.soe.ucsc.edu/admin/exe/macOSX.x86_64/liftOver
22+
23+
Make it executable:
24+
```bash
25+
chmod +x liftOver
26+
```
27+
28+
## Download Chain Files
29+
30+
Download chain files from UCSC for your desired conversion (e.g., hg19 → hg38):
31+
32+
```bash
33+
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/liftOver/hg19ToHg38.over.chain.gz
34+
```
35+
36+
Common chain files:
37+
38+
- `hg19ToHg38.over.chain.gz` (hg19 → hg38)
39+
- `hg38ToHg19.over.chain.gz` (hg38 → hg19)
40+
- `hg18ToHg19.over.chain.gz` (hg18 → hg19)
41+
42+
## Basic Usage
43+
44+
```bash
45+
liftOver input.bed chain_file.chain.gz output.bed unmapped.bed
46+
```
47+
48+
**Arguments:**
49+
50+
- `input.bed`: Input file in BED format (0-based, half-open intervals)
51+
- `chain_file.chain.gz`: Chain file for the conversion
52+
- `output.bed`: Successfully lifted coordinates
53+
- `unmapped.bed`: Failed coordinates with failure reasons
54+
55+
## Input Format (BED)
56+
57+
BED format uses **0-based, half-open intervals**:
58+
59+
```text
60+
chr1 1000 1001 rs123
61+
chr1 2000 2001 rs456
62+
chr2 5000 5001 rs789
63+
```
64+
65+
Columns:
66+
1. Chromosome name
67+
2. Start position (0-based)
68+
3. End position (0-based, exclusive)
69+
4. Name/ID (optional, but useful for tracking)
70+
71+
!!! warning "Coordinate System"
72+
BED format is **0-based**. If your coordinates are **1-based** (e.g., from VCF or sumstats), convert them first:
73+
- 1-based position `N` → BED start: `N-1`, BED end: `N`
74+
75+
## Common Options
76+
77+
```bash
78+
liftOver -minMatch=0.95 input.bed chain_file.chain.gz output.bed unmapped.bed
79+
```
80+
81+
- `-minMatch=0.95`: Minimum match ratio for intervals (default: 0.95)
82+
- `-multiple`: Allow multiple mappings (default: drop ambiguous mappings)
83+
84+
!!! example "Example"
85+
Convert SNP positions from hg19 to hg38:
86+
87+
```bash
88+
# Create input BED file (0-based)
89+
cat > snps_hg19.bed << EOF
90+
chr1 1000000 1000001 rs123
91+
chr1 2000000 2000001 rs456
92+
chr2 5000000 5000001 rs789
93+
EOF
94+
95+
# Run liftover
96+
liftOver snps_hg19.bed hg19ToHg38.over.chain.gz snps_hg38.bed snps_unmapped.bed
97+
98+
# Check results
99+
echo "Successfully lifted:"
100+
wc -l snps_hg38.bed
101+
102+
echo "Failed:"
103+
wc -l snps_unmapped.bed
104+
```
105+
106+
!!! example "Simple Example: Liftover chr1 from BIM File"
107+
This example demonstrates how to extract chromosome 1 positions from a PLINK BIM file and convert them from hg19 to hg38:
108+
109+
```bash
110+
# Extract chr1 positions from BIM file and convert to BED format
111+
# BIM format: chr variant_id genetic_distance position(1-based) allele1 allele2
112+
# BED format: chr start(0-based) end(0-based) variant_id
113+
114+
awk '$1==1 {print "chr1\t" ($4-1) "\t" $4 "\t" $2}' \
115+
01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing.bim \
116+
> chr1_hg19.bed
117+
118+
# Run liftover
119+
liftOver chr1_hg19.bed hg19ToHg38.over.chain.gz chr1_hg38.bed chr1_unmapped.bed
120+
121+
# Check results
122+
echo "Total input positions:"
123+
wc -l chr1_hg19.bed
124+
125+
echo "Successfully lifted:"
126+
wc -l chr1_hg38.bed
127+
128+
echo "Failed:"
129+
wc -l chr1_unmapped.bed
130+
131+
# View first few successfully lifted positions
132+
echo "First 5 lifted positions:"
133+
head -5 chr1_hg38.bed
134+
```
135+
136+
```
137+
==========================================
138+
Liftover Example: chr1 from BIM file
139+
==========================================
140+
141+
Step 1: Extracting chr1 positions from BIM file...
142+
Input: ../01_Dataset/1KG.EAS.auto.snp.norm.nodup.split.rare002.common015.missing.bim
143+
Converting 1-based BIM coordinates to 0-based BED format...
144+
Extracted 97655 chr1 positions
145+
Output: chr1_hg19.bed
146+
147+
Step 2: Running liftover (hg19 → hg38)...
148+
Chain file: hg19ToHg38.over.chain.gz
149+
This may take a few minutes...
150+
Reading liftover chains
151+
Mapping coordinates
152+
Liftover completed
153+
154+
Step 3: Results summary
155+
==========================================
156+
Total input positions: 97655
157+
Successfully lifted: 97526
158+
Success rate: 99.87%
159+
Failed: 258
160+
161+
First 5 failed positions:
162+
#Deleted in new
163+
chr1 1590525 1590526 1:1590526:G:C
164+
#Deleted in new
165+
chr1 1590574 1590575 1:1590575:G:A
166+
#Deleted in new
167+
==========================================
168+
169+
Step 4: Example lifted positions (first 5):
170+
Format: chr start(0-based) end(0-based) variant_id
171+
chr1 14929 14930 1:14930:A:G
172+
chr1 15773 15774 1:15774:G:A
173+
chr1 15776 15777 1:15777:A:G
174+
chr1 57291 57292 1:57292:C:T
175+
chr1 77873 77874 1:77874:G:A
176+
177+
Output files:
178+
Input BED (hg19): chr1_hg19.bed
179+
Output BED (hg38): chr1_hg38.bed
180+
Unmapped positions: chr1_unmapped.bed
181+
```
182+
183+
**Key points:**
184+
185+
- BIM files use **1-based coordinates**, so we subtract 1 to convert to 0-based BED format
186+
- The BED end position is `position` (same as start+1 for single-base variants)
187+
- The variant ID from column 2 is preserved in the BED file for tracking
188+
189+
See `liftover_chr1_example.sh` for a complete script that performs this conversion.
190+
191+
## Output Files
192+
193+
- **`output.bed`**: Contains successfully lifted coordinates in the target assembly
194+
- **`unmapped.bed`**: Contains failed coordinates with reasons (e.g., "No chain found", "Multiple mappings")
195+
196+
## Tips
197+
198+
- Always check the `unmapped.bed` file to see which positions failed and why
199+
- For sumstats, convert 1-based positions to 0-based BED format before liftover
200+
- After liftover, convert back to 1-based if needed for downstream analysis
201+
- Some positions may fail due to assembly differences (centromeres, gaps, duplications) — this is expected
202+
203+
## References
204+
205+
- [UCSC LiftOver Tool](https://genome.ucsc.edu/cgi-bin/hgLiftOver) - Tool for converting coordinates between genome assemblies
206+
- UCSC liftOver tool: **Download and documentation**
207+
http://hgdownload.soe.ucsc.edu/admin/exe/
208+
- UCSC Genome Browser: **liftOver tool**
209+
https://genome.ucsc.edu/cgi-bin/hgLiftOver

0 commit comments

Comments
 (0)