Top 10 Tips for Accurate Copy-Number Calling with Stitchcnv LibraryAccurate copy-number variant (CNV) calling from single-cell or low-input sequencing data is challenging: technical noise, coverage variability, and biological heterogeneity all confound detection. Stitchcnv is a library designed to improve CNV calling by stitching together signals across adjacent bins and cells, applying normalization and denoising strategies tuned for sparse single-cell data. This article gives ten concrete, practical tips to get the most accurate CNV calls from Stitchcnv, covering data preparation, parameter tuning, quality control, and downstream validation.
1 — Start with high-quality input data
Garbage in, garbage out. Stitchcnv’s performance depends heavily on the quality of read alignments and bin counts.
- Use a reliable aligner (BWA-MEM, Bowtie2) and mark duplicates. For single-cell DNA-seq, deduplication can be tricky; follow best practices for your protocol.
- Filter out low-quality reads (e.g., MAPQ < 30) and secondary/supplementary alignments.
- Remove mitochondrial reads and known problematic regions (e.g., centromeres, telomeres, large segmental duplications) that produce artifactual coverage.
- If using scRNA-derived CNV proxies (expression-derived CNV calling), ensure correct gene-to-bin mapping and robust normalization for expression biases.
Concrete checks:
- Per-cell total read counts and fraction of mapped reads.
- Coverage uniformity across the genome (GC bias plots, per-bin mean/variance).
- Library complexity estimates (unique fragments per cell).
2 — Choose an appropriate bin size
Bin size is a crucial tradeoff between resolution and noise. Smaller bins increase resolution but also increase variance; larger bins smooth noise but can miss focal events.
- For low-coverage single-cell DNA-seq: use larger bins (e.g., 500 kb–1 Mb).
- For higher-coverage single-cell or pseudo-bulk data: 100 kb–200 kb bins may be appropriate.
- For scRNA-derived CNV inference, bin by gene windows (e.g., 10–50 genes per bin) rather than fixed genomic length.
Tip: Run Stitchcnv with two or three bin sizes (coarse and fine) and compare—consensus calls across scales are more reliable.
3 — Apply robust normalization and GC-correction
Systematic biases (GC content, mappability) dominate raw coverage signals.
- Use per-bin GC content to model and correct coverage bias (loess or spline fitting).
- Normalize per-cell coverage to account for total read-depth differences (e.g., divide bin counts by per-cell median or apply median-of-ratios).
- Consider iterative normalization: remove global trends first, detect major CN segments, then re-normalize excluding those segments to avoid bias from large-scale aneuploidy.
Stitchcnv provides hooks for custom normalization; validate chosen method by inspecting residual GC trend and per-bin variance after correction.
4 — Filter low-quality bins and cells
Both noisy bins and low-quality cells will produce false-positive CNV calls.
- Exclude bins with extreme mappability issues, unusually high repeat content, or consistently low coverage across many cells.
- Remove cells with insufficient reads, extremely high variance, or abnormal coverage profiles (e.g., coverage concentrated in a few chromosomes).
- Flag cells with suspected doublets or multiplets; these can mimic complex CN patterns.
Practical thresholds vary by dataset; use exploratory plots (coverage histograms, mean-variance plots, PCA/UMAP of bin counts) to set sensible cutoffs.
5 — Use Stitchcnv’s denoising and smoothing thoughtfully
Stitchcnv’s core idea is to “stitch” adjacent bins and leverage cell populations to reduce noise.
- Adjust smoothing window sizes to match expected CNV lengths. Larger smoothing windows increase sensitivity to broad events and reduce focal resolution.
- Use population-guided stitching: combine information from similar cells (clusters) to improve signal-to-noise. But avoid over-smoothing across distinct subclones.
- Monitor for oversmoothing: artificially long segments or complete flattening of true focal events indicates too aggressive smoothing.
Example workflow: cluster cells roughly (by coverage profiles or PCA), perform stitched CN inference per cluster, then refine at single-cell level.
6 — Tune the segmentation parameters for your biology
Segmentation divides the genome into regions of uniform copy number. Parameter choices (penalties, min segment length, significance thresholds) strongly affect results.
- Increase penalty or minimum segment length to reduce fragmentation and false positives when data are noisy.
- Decrease penalty to detect smaller, high-confidence focal events when coverage supports it.
- Use simulated spike-ins or regions with known CNV status to calibrate segmentation hyperparameters.
Document parameter sets used for each analysis and report sensitivity analyses in downstream results.
7 — Leverage joint or hierarchical calling across cells
Many CNVs are cell-population events. Modeling cells jointly increases power.
- Run Stitchcnv in modes that infer consensus breakpoints across cells, then estimate per-cell copy-number states for those breakpoints.
- Hierarchical approaches: first call large-scale aneuploidy across all cells, then detect subclonal structure and refine calls within clusters.
- For tumor or mosaic samples, explicitly model subclonal fractions; per-cell posterior probabilities can help separate true subclonal events from noise.
Joint calling reduces false positives from single-cell noise and improves breakpoint localization.
8 — Validate calls with orthogonal data when possible
Never rely solely on a single computational pipeline for important CNV findings.
- Use bulk whole-genome or exome sequencing, array CGH, or FISH to validate recurrent or clinically relevant events.
- For scRNA-derived CNV calls, cross-check with DNA-based single-cell CNV when available, or with expression signatures consistent with deletion/amplification.
- Validate breakpoints for focal events with split-read or read-pair evidence if sequencing depth allows.
Report validation rates and any discordant calls to characterize method performance.
9 — Use quality metrics and post-call filtering
Produce and use quantitative metrics to decide which calls are reliable.
- Per-segment metrics: mean log-ratio, segment length, number of supporting bins, per-cell support fraction, and statistical confidence (e.g., p-values or posterior).
- Per-cell metrics: fraction of genome altered, number of segments, mean absolute deviation from baseline.
- Apply filters like minimum log-ratio magnitude, minimum number of supporting bins, and minimal cell-fraction for calls considered biologically meaningful.
Provide these metrics in output so downstream analysts can tune stringency for their application.
10 — Document parameters, versions, and reproducible workflows
Reproducibility is essential for CNV analyses.
- Record Stitchcnv version, all parameter values, bin definitions, and normalization steps.
- Containerize the pipeline (Docker/Singularity) and save random seeds for stochastic steps.
- Share intermediate QC plots (GC bias, per-bin variance, segmentation overlays) and provide summary tables of calls with metrics.
A reproducible record makes it possible to re-evaluate calls as methods improve or new validations appear.
Example recommended pipeline (concise)
- Align reads (BWA-MEM), mark duplicates, filter MAPQ < 30.
- Generate bin counts at 200 kb and 1 Mb.
- Remove problematic bins; filter cells by read depth and variance.
- GC-correct and normalize per cell.
- Cluster cells by coverage profile; run Stitchcnv stitching per cluster.
- Jointly segment using consensus breakpoints; estimate per-cell copy states.
- Apply post-call filters (min length, min log-ratio, min cell fraction).
- Validate top calls with bulk data or orthogonal assays.
- Save parameters, QC plots, and call metrics.
Final notes
- There’s no one-size-fits-all configuration: tune bin size, smoothing, and segmentation to your sample type and coverage.
- Combining population-level information with per-cell resolution is the most powerful approach for noisy single-cell CNV data.
- Keep validation and reproducibility central — CNV calls can drive biological or clinical conclusions, so transparency on confidence and methods is crucial.
Leave a Reply