Purpose: Program entry point and command-line interface Key Functions:
main(): Argument parsing, initialization, and process orchestrationprint_usage(): Comprehensive CLI help display- Multi-process fork/wait pattern for parallel sample processing
Technical Details:
- Uses
getopt_long()for extensive CLI parsing (42+ options) - Initializes global lookup tables:
seq2code,code2seq,diff2Hamming - Creates shared memory for thread counting using
mmap() - Forks separate processes for each sample with configurable concurrency limits
- Memory management through
sample_argsstructure passed to child processes
Key Data Structures:
sample_args: Comprehensive parameter structure for sample processingfastq_files_collection: File organization and sample mapping- Global hash tables for whitelist and feature codes
Purpose: Core barcode assignment and UMI deduplication logic Key Functions:
process_files_in_sample(): Main sample processing orchestratorfind_deduped_counts(): UMI clique analysis and deduplicationadd_deduped_count(): Stringency-based count assignmentprintFeatureCounts(): Matrix Market output generationcheckAndCorrectBarcode(): Barcode error correction with posterior probabilities
Technical Details:
- Producer-consumer threading model with ring buffer I/O
- Connected component analysis for UMI error correction (Hamming distance ≤1)
- Three stringency modes: RNA-seq (0), proportional (1-999), strict (≥1000)
- Memory pool allocation for all dynamic structures
- Interactive histogram and heatmap generation
Key Algorithms:
- Graph-based UMI deduplication using breadth-first search
- Bayesian posterior probability for barcode rescue
- Fuzzy matching with configurable Hamming distance thresholds
- Exhaustive vs. targeted feature search strategies
Purpose: Sequence encoding, feature lookup, and matching utilities Key Functions:
string2code(): Convert DNA sequences to 2-bit packed representationcode2string(): Decode packed sequences back to DNA stringsfeature_lookup_kmer(): Optimized k-mer lookup with direct/hash modescheck_sequence(): Validate DNA sequences (ACGT only)
Technical Details:
- 2-bit DNA encoding (A=0, C=1, G=2, T=3) for memory efficiency
- Direct 64-bit comparison for small feature sets (≤128 variants)
- Hash table fallback using GLib GBytes for larger feature sets
- Precomputed lookup tables for sequence conversion and Hamming distances
Optimization Features:
- SIMD-friendly bit operations for sequence comparison
- Cache-efficient direct array lookup for small probe sets
- Lazy initialization of 64-bit probe tables
Purpose: BAM file processing for STAR Solo output Key Functions:
process_bam_single(): Main BAM processing loopextract_kmer_from_bam(): Extract probe k-mers from BAM sequence data- Probe detection with configurable offset and nearby search
Technical Details:
- HTSlib integration for BAM reading and tag extraction
- Extracts CB (cell barcode), UB (UMI), GX/GE (gene) tags
- Handles both forward and reverse-strand reads with complement calculation
- Memory-efficient counting without heavy deduplication tables
- Configurable filtering: MAPQ, duplicates, alignment type
BAM Processing Features:
- Primary alignment filtering with optional secondary/supplementary inclusion
- Duplicate read filtering via BAM_FDUP flag
- Intergenic read handling (GX='-') with optional inclusion
- Probe search with fallback offset testing (±1, ±2)
Purpose: FASTQ-based sample demultiplexing Key Functions:
process_fastq_single(): Main FASTQ processing with producer-consumer modelextract_probe_and_demux(): Probe extraction and sample assignmentget_or_open_sink(): Lazy output file creation and management
Technical Details:
- Multi-threaded producer-consumer with ring buffer
- Per-sample output sinks with mutex-protected concurrent writing
- Direct 64-bit probe comparison vs. hash table lookup
- Automatic file organization with gzip compression
Performance Optimizations:
- Lock-free ring buffer for high-throughput I/O
- Per-sink mutexes to prevent write interleaving
- Configurable buffer sizes and thread counts
Purpose: File I/O operations and data structure initialization Key Functions:
read_features_file(): Parse CSV feature files with flexible column detectionorganize_fastq_files_by_directory(): Automatic FASTQ file discovery and groupingallocate_feature_arrays(): Memory allocation for feature data structuresread_whiteList(): Barcode whitelist loading with validation
Technical Details:
- Flexible CSV parsing with header detection
- Pattern-based FASTQ file matching (R1, R2, R3)
- Memory-efficient feature array allocation with aligned storage
- Whitelist validation with sequence length checking and error correction
File Organization Features:
- Automatic sample detection from directory structure
- Support for explicit file lists vs. directory scanning
- Pattern matching for different sequencing platforms
- Gzip-aware file handling
Purpose: Custom memory pool management system Key Functions:
initialize_memory_pool_collection(): Create pools for all data typesallocate_memory_from_pool(): Fast pool-based allocationexpand_memory_pool(): Dynamic pool expansionfree_memory_pool_collection(): Cleanup all pools
Technical Details:
- Pre-allocated memory blocks for different structure types
- Block-based allocation with automatic expansion
- Zero-copy pool expansion with linked block lists
- Thread-safe allocation within individual pools
Pool Types:
feature_counts_pool: Barcode-feature count matricesfeature_umi_counts_pool: UMI-level count trackingfeature_sequences_pool: Matched sequence storageunmatched_barcodes_features_block_pool: Error correction candidatescb_counts_pool: Cell barcode count arrays
Purpose: QC heatmap generation using Cairo graphics Key Functions:
generate_heatmap(): Feature richness heatmap creationgenerate_deduped_heatmap(): UMI count distribution heatmapdraw_heatmap_core(): Shared rendering engineselect_colormap(): Dynamic colormap selection based on data range
Technical Details:
- Cairo-based PNG generation with anti-aliasing
- Dynamic colormap selection (16/64/256/1024 colors) based on data range
- Integrated bar graphs showing column/row summaries
- Configurable cell sizes and padding for different data scales
Visual Features:
- Plasma colormap for perceptually uniform color scaling
- Feature name labels with automatic text sizing
- Color bars with value annotations
- Bar graphs showing marginal distributions
Purpose: Interactive HTML histogram generation Key Functions:
plot_simple_histogram(): Generate Plotly.js-based interactive plotsgenerate_histogram_html(): HTML template with embedded data- Support for linear/log scale switching
Technical Details:
- Plotly.js integration for interactive visualization
- JSON data embedding with proper escaping
- Responsive design with scale switching controls
- Hover tooltips with detailed information
Purpose: Utility functions and file operations Key Functions:
mkdir_p(): Recursive directory creationfile_exists(): File existence checkingget_basename(): Cross-platform basename extractionhash_int32(),equal_int32(): Hash table utilities for 32-bit keys
Purpose: Thread-safe queue implementation for BFS operations Key Functions:
init_queue(): Initialize queue with dynamic sizingenqueue(),dequeue(): Thread-safe queue operationsis_empty(),clear_queue(): Queue state management
Technical Details:
- Circular buffer implementation with automatic resizing
- Used for connected component analysis in UMI deduplication
- Lock-free operations for high-performance graph traversal
Purpose: Global variable definitions and lookup table initialization Key Variables:
seq2code[],code2seq[][]: DNA sequence encoding/decoding tablesdiff2Hamming[]: Hamming distance lookup for XOR differenceswhitelist,whitelist_hash: Barcode whitelist storagefeature_code_hash: Global feature code lookup table
Purpose: Integration test for assignBarcodes with realistic parameters
Usage: ./test_assign.sh
Features:
- Tests full pipeline with NXT barcode translation
- Uses production-like parameters (limit_search=2, max_barcode_mismatches=1)
- Validates against known reference feature set
Purpose: Comprehensive test suite for demux_fastq Features:
- Tests both hash and direct search modes
- Input/output read count validation
- Per-file-set consistency checking
- Performance timing and comparison
Purpose: Production demux_fastq runner with error checking Features:
- Environment variable configuration
- Input validation and error handling
- Performance benchmarking with timing
- Output verification and consistency checks
Purpose: FASTQ file downsampling for testing and development
Usage: ./downsample_fastq_directory.sh <nReads> [dir]
Features:
- Preserves gzip compression in outputs
- Handles both .fastq and .fastq.gz files
- Creates downsampled/ subdirectory with same file structure
- SIGPIPE-safe gzip handling
Technical Details:
- Uses
head -nfor precise line counting (4 lines per read) - Maintains file extensions and compression
- Robust error handling with
set -euo pipefail
Purpose: Optimize probe detection offset parameters
Usage: python3 offset_sweep.py --test_dir <dir> --offset <center> --window <range>
Features:
- Samples configurable number of reads for analysis
- Tests offset ranges around expected position
- Generates tabular output with match percentages
- Supports all read types (R1, R2, R3)
Algorithm:
- Loads probe k-mers from barcode table
- Extracts sequences from FASTQ files
- Tests exact matches at different offsets
- Reports match rates for optimization
Purpose: Generate C header files for heatmap colormaps Features:
- Creates plasma colormap arrays at different resolutions
- Outputs C-compatible header files
- Supports 16, 64, 256, and 1024 color levels
- Ensures perceptually uniform color scaling
Purpose: STAR Solo alignment wrapper with multiple tool support Features:
- Configurable barcode/UMI lengths via environment variables
- Support for STAR Solo, Alevin-fry, and Kallisto
- Automatic output directory organization
- Comprehensive STAR parameters for single-cell RNA-seq
STAR Configuration:
- CB_UMI_Simple chemistry with configurable lengths
- EmptyDrops cell filtering
- Multi-gene UMI filtering and 1MM_CR deduplication
- Complete SAM attribute output (CB, UB, CR, CY, UR, UY, GX, GN)
Purpose: BAM processing example with production parameters Features:
- Configurable probe offset and thread counts
- MAPQ filtering and verbose output
- Example of typical demux_bam usage patterns
Purpose: Signal-to-noise analysis for feature assignment results Features:
- Processes Matrix Market output files
- Calculates feature-specific signal metrics
- Generates summary statistics for QC assessment
Purpose: Matrix Market file analysis and validation Features:
- Counts unique values in sparse matrices
- Validates Matrix Market format compliance
- Generates distribution statistics for count data
Purpose: Memorial Sloan Kettering gene assignment workflow Features:
- Institution-specific parameter sets
- Optimized for MSK sequencing protocols
- Production-ready parameter validation
- Pool-based allocation: Eliminates malloc/free overhead for frequent allocations
- Block-based expansion: Reduces memory fragmentation
- Thread-local pools: Minimizes contention in multi-threaded contexts
- Aligned structures: Optimizes cache performance
- Process-level: Multiple samples processed simultaneously
- Thread-level: Producer-consumer model within samples
- SIMD optimization: Vectorized sequence operations
- Lock-free data structures: Ring buffers and atomic operations
- Bayesian error correction: Posterior probability-based barcode rescue
- Connected component analysis: UMI error correction through graph algorithms
- Configurable stringency: Multiple deduplication strategies
- Comprehensive validation: Input validation and format checking
- 2-bit sequence encoding: 4x memory reduction for DNA sequences
- Direct vs. hash lookup: Automatic algorithm selection based on data size
- Memory-mapped I/O: Efficient file handling for large datasets
- Lazy initialization: On-demand resource allocation
This architecture provides a scalable, high-performance solution for single-cell feature assignment with extensive configurability and robust error handling.