Skip to content

morphic-bio/STAR-suite

Repository files navigation

STAR Suite

STAR Suite updates the original STAR aligner by integrating four modules — STAR-perturb, STAR-Flex, STAR-SLAM, and TranscriptVB — to provide complete internal C/C++ pipelines for bulk RNA-seq, scRNA-seq, Perturb-seq, 10x Flex, and SLAM-seq. The integration results in substantial speedups (1.7–2.4x for bulk RNA-seq, 1.47–1.60x for scRNA-seq GEX-only Solo vs CellGENI-style STARsolo, 3.7–6.2x for Perturb-seq, 2.5–28.8x for Flex) and a simplified toolchain that can be installed through pre-compiled binaries for researchers and agents. No new external dependencies are required; the suite is built entirely with the existing STAR toolchain and vendored components. This is a drop-in replacement for the STAR aligner.

STAR Suite supports partial compilation: build only the module/tool targets you need instead of building the full suite every time.

Agent quickstart: see AGENTS.md for repo-specific guardrails, tests, and recent changes.

Core Additions over STAR 2.7.11b

  • Speedup: Bulk RNA-seq 1.7–2.4x faster than external stepwise pipelines; scRNA-seq GEX-only Solo 1.47–1.60x faster than the CellGENI-style STARsolo parameter surface (UCSF 14K cells 1.60x, MSK 30K cells 1.47x on fresh 7a7fb08 reruns); Perturb-seq 3.7–6.2x faster than Cell Ranger 9; Flex 2.5x faster than Cell Ranger 9 (5.6x in no-align mode, ~12.8–28.8x vs Cell Ranger 7 with BAM) — all with near-identical parity.
  • Batch Mode (--batchMode 1): Processes multiple FASTQs in one STAR invocation while reusing the loaded genome. Removes the need for --genomeLoad keep-in-memory workflows. Single-pass only (no --twopassMode); not supported with Solo (--soloType). Use --outFileNamePrefixAuto 1 for per-sample subdirectories.
  • TranscriptVB Quantification (--quantMode TranscriptVB): Variational Bayes and EM quantification for transcript-level abundance, with parity-oriented behavior against Salmon alignment-mode. Gene-level summarization via --quantVBgenesMode Tximport.
  • Transcriptome Output (--quantTranscriptomeSAMoutput): Replaces the former --quantTranscriptomeBan with more explicit control (e.g., BanSingleEnd_ExtendSoftclip).
  • Reference Automation (--autoIndex Yes): Automated reference download/build with --cellrangerStyleIndex Yes formatting and --genomeGenerateTranscriptome Yes for transcript-level quant workflows.
  • Native Gzip FASTQ Handling: Automatic detection of .gz FASTQ inputs with internal zlib streaming — no --readFilesCommand zcat needed for correctness. Performance tuning of the internal gzip path is currently specific to Flex, where it is faster than external zcat; on the other benchmarked alignment/Solo surfaces below, external zcat remains the faster validated path. Legacy external helper remains available via --readFilesLegacyZcat Yes.
  • Cutadapt-Compatible Trimming (--trimCutadapt Yes): Native cutadapt-style trimming for bulk/PE workflows. Compatibility mode: --trimCutadaptCompat Cutadapt3.
  • Poly-G Trimming (--clip3pPolyG yes|no|auto): Trims poly-G artifacts common on NovaSeq/NextSeq platforms. Default auto activates in CellRanger4 mode. Without this, poly-G reads can inflate specific genes (e.g., LINC00486) and degrade gene-level correlations.
  • Samtools-style BAM Sorting (--outBAMsortMethod samtools): Spill-to-disk sort to reduce peak RAM pressure. Works with all modes including Flex.
  • Y/NoY Separation (--emitNoYBAM yes, --emitYNoYFastq yes): Split BAM and FASTQ outputs by chrY alignment. Works with bulk, single-cell, and Flex.
  • EmptyDrops_CR Integration: CR-compatible EmptyDrops path (including libscrna-backed behavior in scRNA/perturb flows).
  • Solo Features: sF BAM tag for feature type, --soloCBtype String for arbitrary barcode strings, --soloCellReadStats Standard for improved cell filtering.
  • CR-compat GEX (--soloCrGexFeature auto|gene|genefull): Controls which GEX source is merged in CR-compat mode.
  • CB/UB Tag Pairing (--soloCbUbRequireTogether yes|no): Enforce CB/UB tag pairing for tag injection (default yes).

Folder Structure

core/
  legacy/                        # Upstream STAR layout (single source of truth)
  features/                      # Shared overlays and feature tooling
    process_features/            # Perturb feature extraction/calling implementation
    feature_barcodes/            # Standalone barcode tools (assignBarcodes, demux)
    libscrna/                    # EmptyDrops/OrdMag/Occupancy shared library
flex/                    # Flex-specific code + tools
slam/                    # SLAM-seq code + tools
build/                   # Modular make fragments
scripts/                 # Suite-level helper scripts (see scripts/README.md)
docs/                    # Suite-level docs
tests/                   # Suite-level tests (see tests/ARTIFACTS.md for artifact locations)
tools/                   # Suite-level scripts/utilities
mcp_server/              # MCP server for scripted discovery/preflight/run workflows

Modules

  • STAR-core (core/): Legacy STAR (indexing, bulk, Solo) plus shared utilities. Build: make core (binary at core/legacy/source/STAR).
  • STAR-perturb (core/legacy/ + core/features/process_features/): CR-compatible perturb-seq path with integrated feature extraction/calling (process_features + call_features) and crispr_analysis/ outputs in CR-compat mode. Primary run path: STAR --pfMultiConfig ... --defaultCrCompat yes (see STAR-perturb section below).
  • STAR-Flex (flex/): FlexFilter pipeline and Flex-specific integrations. Build tools: make flex or make flex-tools.
  • STAR-SLAM (slam/): SLAM-seq quantification, SNP masking, trimming/QC. Build tools: make slam or make slam-tools.
  • Feature Barcodes (core/features/feature_barcodes/): Standalone barcode tools (assignBarcodes, demux_bam, demux_fastq) for perturb-seq testing. Build tools: make feature-barcodes-tools.
  • Process Features (core/features/process_features/): Full feature extraction/calling pipeline (assignBarcodes, call_features, demux_bam, demux_fastq) and standalone tool (star_feature_call). Build tools: make process-features-tools, make star-feature-call.
  • Shared Feature Toolchains (core/features/): Reusable tool layers used across modules, including vbem (TranscriptVB helpers), yremove_* (Y/noY splitting), bamsort, and libscrna. Build tools: make vbem-tools, make yremove-tools, plus in-core integrations.
  • MCP Server (tooling) (mcp_server/): Agent automation service for dataset/test discovery and controlled execution (list_datasets, list_test_suites, preflight, run_script, collect_outputs). This is repo tooling, not an analysis module.
  • Helper Scripts (scripts/): Standalone Python and Bash tools for FASTQ preflight, QC, parity benchmarking, downstream h5ad processing, and fixture management. These are not compiled into STAR; they run independently. Highlights include preflight_library_pairing.py (chemistry detection and library pairing for mislabeled Perturb-seq), report_additional_parity_metrics.py (STAR vs CR parity), and build_gene_full_velocyto_h5ad.py (Velocyto h5ad packaging). See scripts/README.md for the full catalogue.

Benchmarks

All benchmarks run on pikachu (i9-13900KF, 126 GB RAM, 32 threads). The table below keeps the README focused on the headline results. Publication-facing wrappers live in publications/benchmarks/README.md, archived benchmark artifacts live in comparisons/paper_benchmarks_20260318/README.md, and detailed Velocyto bridge results live in docs/VELOCYTO_BENCHMARKS.md.

Workflow Dataset / surface Baseline STAR-suite result Key parity / note
Bulk RNA-seq JAX PE 6.5M External stepwise (Trim Galore + STAR + Salmon) 37 s without Y-removal, 61 s with Y-removal; 2.4x / 2.1x faster Transcript Pearson 0.995, gene Pearson 0.997 vs Salmon
Bulk RNA-seq PPARG PE 35.1M External stepwise (Trim Galore + STAR + Salmon) 9m 35s without Y-removal, 11m 58s with Y-removal; 1.7x / 2.1x faster Same integrated trim + align + TranscriptVB path
scRNA-seq Solo UCSF EBs2_2 GEX-only Historical CellGENI-style STARsolo (7a7fb08) 13.75 min optimized zcat vs 22.1 min historical rerun; 1.60x faster Fresh historical rerun reproduced 13,847 cells, Jaccard 0.9891, gene Pearson 0.964305 vs CR9; current zcat surface calls 13,723 cells with gene Pearson 0.994885
scRNA-seq Solo MSK 30polyKO GEX-only Historical CellGENI-style STARsolo (7a7fb08) 19.40 min archived modern wall vs 28.6 min historical rerun; 1.47x faster Fresh historical rerun reproduced 32,304 cells, Jaccard 0.9975, gene Pearson 0.954925 vs CR9; guarded current surface calls 33,092 cells with Jaccard 0.974, gene Pearson 0.994554
Perturb-seq A375 1k CRISPR 5' GemX Cell Ranger 9 4.0 min; 3.8x faster Jaccard 0.976, gene Pearson 0.975, CRISPR match 100%
Perturb-seq UCSF EBs2_2 Cell Ranger 9 16.4 min; 3.7x faster Jaccard 0.976, gene Pearson 0.995, CRISPR match 98.9%
Perturb-seq MSK 30polyKO Cell Ranger 9 (separate GEX+gRNA and GEX+LARRY runs) 26.9 min; 6.2x faster Jaccard 0.9742, gene Pearson 0.994554, CRISPR match 98.0%
Flex JAX SC2300771 4-tag full / no-align Cell Ranger 9 / 7 23m 22s full (2.5x vs CR9), 10m 33s no-align (5.6x vs CR9) Fresh canonical 4-tag reruns (BC004/BC006/BC007/BC008) reproduce 20,316 cells; mean Jaccard 0.981, cell Pearson 0.99997, gene Pearson 0.99993 vs CR9
SLAM-seq NW-5-21 ARID1A compat mode GEDI / GRAND-SLAM family Integrated single-pass alignment + quantification; no apples-to-apples end-to-end wall-time claim reported NTR Pearson 0.967-0.978, Spearman 0.985-0.990 vs GEDI

Perturb-seq is the main performance result: on A375, UCSF, and MSK surfaces, STAR-suite runs 3.7x-6.2x faster than Cell Ranger 9 while maintaining near-identical GEX/cell metrics and 98.0-100% CRISPR call agreement.

For non-Flex Solo, the README now summarizes only the historical CellGENI-style baseline versus the current optimized surface. On this host, external zcat remains the fastest validated read path for UCSF/MSK GEX-only and perturb runs; native .gz input is functional but not yet the fastest on those surfaces. Fresh 7a7fb08 reruns reproduced the archived CellGENI filtered barcode sets exactly for both UCSF and MSK, so the historical Jaccards above are validated rather than stale-artifact carryovers.

All perturb parity metrics above were computed with scripts/report_additional_parity_metrics.py --gene-corr-min-counts 20 --gene-corr-min-cells-pct 0.01 per docs/PAPER_BENCHMARK_METHODOLOGY.md. CR9 references use refdata-gex-GRCh38-2024-A unless noted otherwise in the archived benchmark artifacts.

For MSK specifically, the historical raw-matrix EmptyDrops isolation result and the real guarded end-to-end benchmark surface are separated in docs/MSK_BENCHMARK_SURFACE_AUDIT_20260403.md.

Building & Installing

From source

# Core STAR binary
make core

# Module-focused builds
make flex           # core + Flex tools
make slam           # core + SLAM tools

# Individual tool targets
make feature-barcodes-tools    # assignBarcodes/demux (standalone)
make process-features-tools    # full process_features pipeline
make star-feature-call         # standalone feature caller
make vbem-tools                # TranscriptVB helpers
make yremove-tools             # Y/noY splitting tools

# Default build (core + common tools)
make                           # or: make default

# Build everything
make all

Selective filtering:

make default INCLUDE="core slam-tools"
make default EXCLUDE="flex-tools"

Run make help to see the full target list and descriptions.

From release artifacts

# Ubuntu package from a local artifact
sudo apt install ./star-suite_<version>_<arch>.deb

# Installer tarball (auto-detects host glibc level)
tar -xzf STAR-suite-<version>-linux-<arch>-installer.tar.gz
cd STAR-suite-<version>-linux-<arch>-installer
./install.sh

# Manual compatibility tarball
tar -xzf STAR-suite-<version>-linux-<arch>-glibc234.tar.gz
cd STAR-suite-<version>-linux-<arch>-glibc234
./install.sh

Release tarballs are validated in clean Ubuntu 22.04 and 24.04 Docker containers before publication. The installer bundle auto-detects the host glibc level and chooses the right bundled binary.

Packaging/release details and artifact policy:

  • docs/Star-binary-distribution.md
  • docs/Github-actions.md

Compilation details (module-by-module, clean rebuilds, and clean Ubuntu 24.04 validation):

  • docs/compile_instructions.md

Docker

A multi-stage Docker setup (Ubuntu 24.04) provides a clean build environment and separate runtime/test images.

Builder stage: Compiles STAR Suite from source with no host leakage. Validates make core, flex, slam, feature-barcodes-tools, default, and all.

Suite base runtime (suite-base): Minimal executable image with suite binaries (e.g. STAR) and no Python/test-only helpers.

Test images (built from suite-base):

  • test-tier-a: self-contained smoke helpers.
  • test-tier-b: fixture-backed helper stack (e.g. python3, bc, samtools).

Quickstart

# Build suite base image (default tag: biodepot/star-suite:latest)
./scripts/docker/build_image.sh

# Override tag or parallel jobs
IMAGE_TAG=myorg/star-suite:v1 MAKE_JOBS=8 ./scripts/docker/build_image.sh

# Reproducibility check: force a clean rebuild (no cache)
docker build --no-cache --target suite-base -f docker/Dockerfile -t biodepot/star-suite:latest --build-arg MAKE_JOBS=8 .

# Run STAR from suite base image
docker run --rm biodepot/star-suite:latest

# Run Tier A smoke tests (builds/uses test-tier-a image)
./scripts/docker/run_smokes_tier_a.sh

# Run Tier B smoke tests (builds/uses test-tier-b image; requires fixtures)
./scripts/docker/run_smokes_tier_b.sh

Fixture mount for Tier B

Tier B tests require data under /storage. Mount your fixture root:

docker run --rm -v /path/to/your/data:/storage biodepot/star-suite:test-tier-b bash -c "tests/run_cbub_regression_test.sh"

By default, ./scripts/docker/run_smokes_tier_b.sh uses STORAGE=/storage. Set STORAGE=/path to override (script uses it for the -v mount).

Expected layout: /storage/A375, /storage/flex_filtered_reference, etc. See plans/docker_plan.md for full fixture roots.

STAR_BIN override

Smoke tests honor STAR_BIN to decouple from source-relative paths. Docker smoke wrappers set STAR_BIN=/usr/local/bin/STAR automatically.

Validation

See docs/docker_validation.md for the latest portability check results.

Module Reference

This section documents the key features and flags for each module. For standard STAR flags not listed here, see core/legacy/README.md. Core additions are listed above in Core Additions over STAR 2.7.11b.

Flex

See flex/README_flex.md for the full pipeline reference.

STAR-Flex uses a pseudo-chromosome alignment approach: probe sequences are embedded as pseudo-chromosomes in a hybrid reference genome, and STAR's native alignment machinery handles gene assignment. Core features (trimming, spill-to-disk sorting, Y-chromosome splitting, TranscriptVB) all work with Flex.

Key flags:

  • --flex yes: Enable Flex pipeline.
  • --soloFlexExpectedCellsPerTag: Expected cells per sample tag.
  • --soloSampleWhitelist: TSV mapping sample tags to labels.
  • --soloProbeList: Probe gene list (auto-detected from index if omitted).
  • --soloSampleProbes: 10x probe barcode sequences file.

Features:

  • Sample tag detection, 1MM pseudocount correction for CBs, clique-based UMI deduplication, and occupancy filtering.
  • Y-chromosome splitting tested and validated (tests/TEST_REPORT_Y_SPLIT_FLEX.md).

Flex Parity: CR9-Projected Leiden UMAP

Using a fixed CR9 embedding removes the visual ambiguity from independently fit UMAPs. When full-align and no-align are both projected into the same CR9 PCA/UMAP space, they use the same 13 CR9 Leiden clusters and agree almost perfectly on shared cells: projected-label ARI 0.9979, NMI 0.9967 on 20,315 shared cells.

CR9 Reference STAR-Flex Full Projected To CR9 STAR-Flex No-Align Projected To CR9
CR9 reference Leiden UMAP STAR-Flex full projected to CR9 Leiden UMAP STAR-Flex no-align projected to CR9 Leiden UMAP

SLAM

See slam/docs/SLAM_COMPATIBILITY_MODE.md and slam/docs/SLAM_seq.md.

Integrated SLAM-seq quantification with GRAND-SLAM parity:

Key flags:

  • --slamQuantMode 1: Enable SLAM quantification.
  • --slamGrandSlamOut 1: Generate GRAND-SLAM compatible output.
  • --slamCompatMode gedi: Enable GEDI compatibility (intronic classification, lenient overlap, overlap weighting).
  • --slamCompatIntronic, --slamCompatLenientOverlap: Fine-grained compat control.
  • --autoTrim variance: Variance-based detection of artifact-prone read ends.
  • --slamTrim5p, --slamTrim3p: Manual trim guards.
  • --slamErrorRateFromBlank 1: Seed error rate from a blank (e.g. no4sU) sample.
  • --outFileNamePrefixAuto 1: Derive sample name from first FASTQ and route outputs into subdirs.
  • --slamDumpBinary 1 --slamDumpWeights 1: Emit binary dumps for offline re-quantification with slam_requant.

Features:

  • Full gene-level NTR estimation (Binomial/EM models).
  • Auto-trimming: variance-based detection of artifact-prone read ends.
  • QC: comprehensive interactive HTML reports for T->C rates and error modeling.
  • Batch layout organizes outputs into alignments/, counts/, qc/, y_separated/.
  • Binary dump format documented in slam/docs/SLAM_DUMP_FORMAT.md.

STAR-perturb / CR-Compat

See docs/feature_barcodes.md and docs/CRISPR_FEATURE_CALLING_IMPLEMENTATION_SUMMARY.md.

CR-compatible Solo behavior with integrated CRISPR feature calling:

Key flags:

  • --pfMultiConfig: Cell Ranger-style multi processing with feature libraries.
  • --defaultCrCompat yes: Apply the CR-compat perturb defaults bundle.
  • --dynamicThreadInterface 1: Enable STAR/PF permit coordination.
  • --dynamicThreadConstMapPermits 32: Start with full map-side permit budget.
  • --crAssignConsumerThreads 32: Provision PF worker pool to full host budget.
  • --crAssignSearchThreads 1: Per-consumer search-thread mode.
  • --crMinUmi: Minimum UMI threshold for CRISPR feature calling (default 10; lower to 2-3 for lineage barcodes).
  • --soloCrGexFeature: Control merged GEX source (auto, gene, genefull).
  • --soloCrMode CR: Enable CR-compatible single-cell behavior.
  • --crChemistry: Barcode chemistry (auto, NXT, TRU). Default auto enables per-library auto-detection. Mixed NXT/TRU experiments are handled automatically; per-library overrides via the star_chemistry column in --pfMultiConfig.

Recommended execution profile (32-thread host):

--runThreadN 32 --dynamicThreadInterface 1 --dynamicThreadConstMapPermits 32 \
--dynamicThreadTelemetry 1 --crAssignConsumerThreads 32 --crAssignSearchThreads 1

Standalone tool (star_feature_call):

  • --compat-perturb: CR9-compatible output layout (crispr_analysis/).
  • --feature-ref, --whitelist, --fastq-dir, --output-dir: FASTQ -> MEX -> calls.
  • --call-only --mex-dir: call_features-only pass on existing MEX.
  • --emptydrops-use-fdr, --min-umi, --ratio-test: calling controls.

QC Outputs

  • SLAM QC (--slamQcReport <prefix>): Interactive HTML report (.html) and JSON metrics (.json) for T->C conversion rates, variance analysis, and trimming overlays.
  • FlexFilter QC (flexfilter_summary.tsv): Cell calling statistics (EmptyDrops/OrdMag), cell counts, UMI thresholds, and filtering rates per sample.

Sample Commands

Core alignment:

core/legacy/source/STAR \
  --runMode alignReads \
  --genomeDir /path/to/genome_index \
  --readFilesIn reads.fq.gz \
  --outFileNamePrefix out/ \
  --outSAMtype BAM SortedByCoordinate \
  --outSAMattributes NH HI AS nM MD

Batch mode (bulk, single-pass, SE):

core/legacy/source/STAR \
  --runMode alignReads \
  --genomeDir /path/to/genome_index \
  --readFilesIn A_R1.fq.gz,B_R1.fq.gz \
  --outFileNamePrefix /path/to/out_root/ \
  --outFileNamePrefixAuto 1 \
  --batchMode 1 \
  --outSAMtype BAM SortedByCoordinate

Batch mode (bulk, single-pass, PE):

core/legacy/source/STAR \
  --runMode alignReads \
  --genomeDir /path/to/genome_index \
  --readFilesIn A_R1.fq.gz,B_R1.fq.gz A_R2.fq.gz,B_R2.fq.gz \
  --outFileNamePrefix /path/to/out_root/ \
  --outFileNamePrefixAuto 1 \
  --batchMode 1 \
  --outSAMtype BAM SortedByCoordinate

Flex Mode (10x Fixed RNA Profiling):

core/legacy/source/STAR \
  --runMode alignReads \
  --genomeDir /path/to/flex_index \
  --readFilesIn reads_R2.fq.gz reads_R1.fq.gz \
  --flex yes \
  --soloType CB_UMI_Simple \
  --soloCBwhitelist /path/to/737K-fixed-rna-profiling.txt \
  --soloSampleWhitelist sample_whitelist.tsv \
  --outFileNamePrefix output/

SLAM Mode (Standard):

core/legacy/source/STAR \
  --runMode alignReads \
  --genomeDir /path/to/genome_index \
  --readFilesIn reads.fq.gz \
  --outFileNamePrefix out/ \
  --outSAMtype BAM SortedByCoordinate \
  --outSAMattributes NH HI AS nM MD \
  --slamQuantMode 1 \
  --slamSnpBed /path/to/snps.bed

SLAM Mode (GEDI Compatibility):

core/legacy/source/STAR \
  --runMode alignReads \
  --genomeDir /path/to/genome_index \
  --readFilesIn reads.fq.gz \
  --slamQuantMode 1 \
  --slamCompatMode gedi \
  --autoTrim variance \
  --outFileNamePrefix output/

SLAM Batch Mode (blank-first, SE/PE):

core/legacy/source/STAR \
  --runMode alignReads \
  --genomeDir /path/to/genome_index \
  --readFilesIn blank_R1.fq.gz,0h_R1.fq.gz,6h_R1.fq.gz,24h_R1.fq.gz \
  --outFileNamePrefix /path/to/out_root/ \
  --outFileNamePrefixAuto 1 \
  --slamQuantMode 1 \
  --slamBatchMode 1 \
  --slamErrorRateFromBlank 1 \
  --slamSnpBed /path/to/snps.bed

For paired-end, pass two comma-separated mate lists: --readFilesIn blank_R1.fq.gz,0h_R1.fq.gz,... blank_R2.fq.gz,0h_R2.fq.gz,...

STAR-perturb (integrated CR-compat mode):

core/legacy/source/STAR \
  --runMode alignReads \
  --runThreadN 32 \
  --genomeDir /path/to/index \
  --pfMultiConfig /path/to/multi_config.csv \
  --dynamicThreadInterface 1 \
  --dynamicThreadConstMapPermits 32 \
  --crAssignSearchThreads 1 \
  --defaultCrCompat yes \
  --outFileNamePrefix /path/to/outs/

STAR-perturb (standalone feature pipeline):

core/legacy/source/star_feature_call \
  --compat-perturb \
  --feature-ref /path/to/feature_reference.csv \
  --whitelist /path/to/whitelist.txt \
  --fastq-dir /path/to/feature_fastqs \
  --filtered-barcodes /path/to/filtered_barcodes.tsv \
  --output-dir /path/to/feature_out \
  --emptydrops-use-fdr \
  --min-umi 10

Codespaces Walkthroughs

STAR Suite includes GitHub Codespaces walkthroughs for the main module entry points.

Start here:

Ready now:

Work in progress:

Helpful follow-up guides:

More Detail

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors