Cell lines
Cell lines for Puerto Rican individuals HG00731, HG00732 and HG00733 have been previously described19.
HiFi PacBio sequencing
Isolated DNA was prepared for HiFi library preparation as described3. Briefly, DNA was sheared to an average size of about 15 kbp using Covaris gTUBE, and the quantity and size were checked using Qubit (Thermo Fisher) and FEMTO Pulse (Agilent) instruments. Fragments underwent library preparation using the Template Prep Kit v1 (PacBio) and then fractionation on a SageELF (Sage Science) instrument. After evaluating size, fractions averaging 11, 13 or 15 kbp were sequenced on a Sequel II (PacBio) instrument using Sequel II chemistry v1 or v2EA (Early Access beta). After sequencing, raw data were analyzed with SMRT Link 7.1 or 8.0 using the CCS protocol with a cutoff minimum of three passes and estimated accuracy of 0.99. In total, 18 SMRT Cell 8Ms were run for the Puerto Rican trio (HG00731, HG00732 and HG00733) for an average yield per sample of 91 Gbp of HiFi reads (Supplementary Table 7).
Strand-seq data analysis
All Strand-seq data in a FASTQ format were obtained from publicly available sources (‘Data availability’). At every step that requires alignment of short-read Strand-seq data to the squashed or clustered de novo assembly (Fig. 1), we used BWA-MEM (version 0.7.15-r1140) with the default parameters. In the next step, we filtered out all secondary and supplementary alignments using SAMtools (version 1.9). Subsequently, duplicate reads were marked using Sambamba (version 0.6.8). For every Strand-seq data analysis, we filtered out reads with mapping quality less than 10 as well as all duplicate reads.
Squashed genome assembly
Initially, squashed assemblies were constructed to produce a set of unphased contigs. We assembled HiFi reads using the Peregrine assembler.
All Peregrine (v0.1.5.5) assemblies were run using the following command:
pg_run.py asm {reads.fofn} 36 36 36 36 36 36 36 36 36 –with-consensus \
–shimmer-r 3 –best_n_ovlp 8 –output {assembly.dir}
Clustering contigs into chromosomal scaffolds
We used the R package SaaRclust15 to cluster de novo squashed assemblies into chromosomal scaffolds. SaaRclust takes as an input Strand-seq reads aligned to the squashed de novo assembly in a BAM format. Given the parameter settings, we discarded contigs shorter than 100 kbp from further analysis. Remaining contigs were partitioned into variable sized bins of 200,000 Strand-seq mappable positions. The counts of aligned reads per bin, separated by directionality (+/Crick or −/Watson), are used as an input for SaaRclust that divides contigs into a user-defined number of clusters (set to n = 100|150). Contigs genotyped as Watson–Crick (WC) in most cells were discarded. We further removed contigs that could be assigned to multiple clusters with probability P < 0.25 (Supplementary Fig. 23). Subsequently, SaaRclust merges clusters that share the same strand inheritance across multiple Strand-seq libraries. Shared strand inheritance is used to construct a graph of connected components (clusters), and the most connected subgraphs are reported, resulting in approximately 24 clusters—that is, one cluster should ideally be representative of one human chromosome. Next, we defined misoriented contigs within each cluster as those having opposing directionality in every Strand-seq library. We used hierarchical clustering to detect groups of minus-oriented and plus-oriented contigs. To synchronize contig directionality, we switch direction in one group of contigs from plus to minus or vice versa. Contigs synchronized by direction are then subjected to positional ordering within a cluster. We again use contig strand state coinheritance as a proxy to infer physical distance for each contig pair in every Strand-seq library. The resultant coinheritance matrix serves as input for the ‘Traveling Salesman Algorithm’ implemented in R package TSP (version 1.1–7)41 and attempts to order contigs based on strand state coinheritance. As the initial squashed assembly might contain assembly errors, SaaRclust is able to detect and correct such errors as bins of the same contig being assigned to different clusters (‘Chimeric contig’) or bins of the same contig that differ in directionality (‘Misoriented contig’). Lastly, we export clustered, reoriented and ordered contigs into a single FASTA file with a single FASTA record per cluster. A complete list of parameters used to run SaaRclust in this study is reported below:
SaaRclust command:
scaffoldDenovoAssembly(bamfolder = <>, outputfolder = <>, store.data.obj = TRUE, reuse.data.obj = TRUE, pairedEndReads = TRUE, bin.size = 200000, step.size = 200000, prob.th = 0.25, bin.method = ’dynamic’, min.contig.size = 100000, assembly.fasta = assembly.fasta, concat.fasta = TRUE, num.clusters = 100|150, remove.always.WC = TRUE, mask.regions = FALSE)
Variant calling
Clustered assemblies in full chromosomal scaffolds are then used for realignment of long PacBio reads. To call variants in HiFi reads, we use DeepVariant38 v0.9.0, which uses a deep neural network with a pre-trained model (–model_type=PACBIO). For the variant calling, HiFi reads were aligned using pbmm2 v1.1.0 (https://github.com/PacificBiosciences/pbmm2) with settings align –log-level DEBUG –preset CCS –min-length 5000 and filtered with samtools view -F 2308. After variant calling, we select only heterozygous SNVs using BCFtools v1.9.
For both PacBio CLR and ONT reads, we use the LongShot variant caller:
longshot –no_haps –force_overwrite –auto_max_cov
–bam {alignments} –ref {clustered_assm}
–region {contig} –sample_id {individual} –out {output}
Phasing chromosomal scaffolds
To create completely phased chromosomal scaffolds, we used a combination of Strand-seq and long-read phasing18. First, we realigned Strand-seq data on top of the clustered assemblies as stated previously. Only regions that inherit a Watson and Crick template strand from each parent are informative for phasing and are detected using breakpointR42. Haplotype-informative regions are then exported using the breakpointR function called ‘exportRegions’. Using the set of haplotype-informative regions together with positions of heterozygous SNVs, we ran StrandPhaseR18 to phase SNVs into whole-chromosome haplotypes. Such sparse haplotypes are then used as a haplotype backbone for long-read phasing using WhatsHap to increase density of phased SNVs.
breakpointR command (run and export of results):
breakpointr(inputfolder = <>, outputfolder = <>, windowsize = 500000, binMethod = ’size’, pairedEndReads = TRUE, pair2frgm = FALSE, min.mapq = 10, filtAlt = TRUE, background = 0.1, minReads = 50)
exportRegions(datapath = <>, file = <>, collapseInversions = TRUE, collapseRegionSize = 5000000, minRegionSize = 5000000, state = ’wc’)
StrandPhaseR command:
strandPhaseR(inputfolder = <>, positions = <SNVs.vcf>, WCregions = <hap.informtive.regions>, pairedEndReads = TRUE, min.mapq = 10, min.baseq = 20, num.iterations = 2, translateBases = TRUE, splitPhasedReads = TRUE)
WhatsHap command:
whatshap phase –chromosome {chromosome} –reference {reference.fasta} {input.vcf} {input.bam} {input.vcf_sparse_haplotypes}
Haplotagging PacBio reads
Having completely phased chromosomal scaffolds at sufficient SNV density allows us to split long PacBio reads into their respective haplotypes using WhatsHap. This step can be performed in two ways: splitting all reads across all clusters into two bins per haplotype or splitting reads into two bins per cluster and per haplotype. Both strategies consist of the same two steps: 1) labeling all reads with their respective haplotype (‘haplotagging’) and 2) splitting the input reads only by haplotype or by haplotype and cluster (‘haplosplitting’). The WhatsHap commands are identical in both cases except for limiting WhatsHap to a specific cluster during haplotagging and discarding reads from other clusters to separate the reads by haplotype and cluster:
whatshap haplotag [–regions {cluster}] –output {output.bam} –reference {input.fasta} –output-haplotag-list {output.tags}{input.vcf} {input.bam}
whatshap split [–discard-unknown-reads] –pigz –output-h1 {output.hap1} –output-h2 {output.hap2} –output-untagged {output.un} –read-lengths-histogram {output.hist} {input.fastq} {input.tags}
Creating haplotype-specific assemblies
After haplotagging and haplosplitting, the long HiFi reads separated by haplotype were then used to create fully haplotype-resolved assemblies. Our haplotagging and haplosplitting strategy enabled us to examine two types of haploid assemblies per input long-read dataset: the two haplotype-only assemblies (short: h1 and h2), plus the haploid assemblies created by using also all untagged reads—that is, all reads that could not be assigned to a haplotype (short: h1-un and h2-un). Hence, for each input read dataset, this amounts to four ‘genome-scale’ assemblies. We focused our analyses on the read sets h1-un (H1) and h2-un (H2). Final phased assemblies were created using parameters stated in the ‘Squashed genome assembly’ section.
SD analysis
SDs were defined as resolved or unresolved based on their alignments to SDs defined in GRCh38 (http://genome.ucsc.edu/cgi-bin/hgTables?db=hg38&hgta_group=rep&hgta_track=genomicSuperDups&hgta_table=genomicSuperDups&hgta_doSchema=describe+table+schema) using minimap2 with the following parameters: —secondary=no -a –eqx -Y -x asm20 -m 10000 -z 10000,50 -r 50000 –end-bonus=100 -O 5,56 -E 4,1 -B 5 (ref. 33). Alignments that extended a minimum number of base pairs beyond the annotated SDs were considered to be resolved. The percent of resolved SDs was determined for minimum extension varying from 10,000 to 50,000 bp, and the average was reported. This analysis is adapted from Vollger et al.34 (https://github.com/mrvollger/segdupplots).
SD collapse analysis
Collapses were identified using the methods described in Vollger et al.34. In brief, the method identifies regions in the assemblies that are at least 15 kbp in length and have read coverage exceeding the mean coverage plus three standard deviations. Additionally, collapses with more than 75% common repeat elements (identified with RepeatMasker) or TRs (identified with Tandem Repeats Finder43) are excluded.
BAC clone insert sequencing
BAC clones from the VMRC62 clone library were selected from random regions of the genome not intersecting with an SD (n = 77). DNA from positive clones were isolated, screened for genome location and prepared for long-insert PacBio sequencing as previously described (Segmental Duplication Assembler (SDA))34. Libraries were sequenced on the PacBio RS II with P6-C4 chemistry (17 clones) or the PacBio Sequel II with Sequel II 2.0 chemistry (S/P4.1-C2/5.0-8 M; 60 clones). We performed de novo assembly of pooled BAC inserts using Canu v1.5 (Koren et al.26) for the 17 PacBio RS II BACs and using the PacBio SMRT Link v8.0 Microbial assembly pipeline (Falcon + Raptor, https://www.pacb.com/support/software-downloads/) for the 60 Sequel II BACs. After assembly, we removed vector sequence pCCBAC1, re-stitched the insert and then polished with Quiver or Arrow. Canu is specifically designed for assembly with long error-prone reads, whereas Quiver/Arrow is a multi-read consensus algorithm that uses the raw pulse and base-call information generated during SMRT (single-molecule, real-time) sequencing for error correction. We reviewed PacBio assemblies for misassembly by visualizing the read depth of PacBio reads in Parasight (http://eichlerlab.gs.washington.edu/jeff/parasight/index.html), using coverage summaries generated during the resequencing protocol.
Assembly polishing and error correction
Assembly misjoints are visible using Strand-seq as recurrent changes in strand state inheritance along a single contig. Strand state changes can result from a double-strand break (DSB) repaired by homologous recombination during DNA replication, causing an SCE1. DSBs are random independent events that occur naturally during a cell’s lifespan and, therefore, are unlikely to occur at the same position in multiple single cells2. Instead, a strand state change at the same genomic position in a population of cells is indicative of a different process other than DSB (such as a genomic SV or genome misassembly)13,44,45. Observing a complete switch from WW (Watson–Watson) to CC (Crick–Crick) strand state or vice versa at about 50% frequency is observed when a part of the contig is being misoriented (Supplementary Fig. 6). All detected misassemblies in the final phased assemblies (Supplementary Table 1) were corrected using SaaRclust using the following parameters:
scaffoldDenovoAssembly(bamfolder = <>, outputfolder = <>, store.data.obj = TRUE, reuse.data.obj = TRUE, pairedEndReads = TRUE, bin.size = 200000, step.size = 200000, prob.th = 0.9, bin.method = ’dynamic’, ord.method = ’greedy’, min.contig.size = 100000, min.region.to.order = 500000, assembly.fasta = assembly.fasta, concat.fasta = FALSE, num.clusters = 100|150, remove.always.WC = TRUE, mask.regions = FALSE)
Common assembly breaks
To detect recurrent breaks in our assemblies, we searched for assembly gaps present in at least one phased assembly completed by Flye (for CLR PacBio reads) or Peregrine (for HiFi PacBio reads). For this, we mapped all haplotype-specific contigs to GRCh38 using minimap2 using the same parameters as in the SD analysis method. We defined an assembly break as a gap between two subsequent contigs. We searched for reoccurring assembly breaks in 500-kbp non-overlapping bins and filtered out contigs smaller than 100 kbp. Each assembly break was defined as a range between the first and the last breakpoint found in any given genomic bin and was annotated based on the overlap with known SDs, gaps, centromeres and SV callsets19, allowing overlaps within 10-kbp distance from the breakpoint boundaries.
Base accuracy
Phred-like QV calculations were made by aligning the final assemblies to 77 sequenced and assembled BACs from VMRC62 falling within unique regions of the genome (>10 kbp away from the closest SD) where at least 95% of the BAC sequence was aligned. The following formula was used to calculate the QV, and insertions and deletions of size N were counted as N errors: QV = –10log10(1 – (percent identity/100)).
Each assembly was polished twice with Racon28 using the haplotype-partitioned HiFi FASTQs. The alignment and polishing steps were run with the following commands:
minimap2 -ax map-pb –eqx -m 5000 -t {threads} –secondary=no {ref} {fastq} | samtools view -F 1796 – > {sam}
racon {fastq} {sam} {ref} -u -t {threads} > {output.fasta}
The HG00733 ONT assemblies were polished with MarginPolish/HELEN32 (git commit 4a18ade) following developer recommendations. The alignments were created with minimap2 v2.17 and used for polishing as follows:
minimap2 -ax map-ont -t {threads} {assembly} {reads} |
samtools sort -@ {threads} |
samtools view -hb -F 0×104>{output}
marginpolish {alignments} {assembly} MP_r941_guppy344_human.json
–threads {threads} –produceFeatures –outputBase {output}
helen polish –image_dir {mp_out} –model_path HELEN_r941_guppy344_human.pkl
–threads {threads} –output_dir {output} –output_prefix HELEN
QV estimates based on variant callsets lifted back to the human reference hg38 were derived as follows: Genome in a Bottle46 high-confidence region sets (release v3.3.2) for individuals HG001, HG002 and HG005 were downloaded, and the intersection of all regions (BEDTools v2.29.0 ‘multiinter’47) was used as proxy for high-confidence regions in other samples (covering ~2.17 Gbp). For all samples, variant callsets based on Illumina short-read alignments against the respective haploid assembly were generated using BWA 0.7.17 and FreeBayes v1.3.1 as follows:
bwa mem -t {threads} -R {read_group} {index_prefix} {reads_mate1} {reads_mate2} | samtools view -u -F 3840 – |
samtools sort -l 6 {output_bam}
The BAM files were sorted with SAMtools v1.9 and duplicates marked with Sambamba v0.6.6 ‘markdup’. The variant calls with FreeBayes were generated as follows:
freebayes –use-best-n-alleles 4 –skip-coverage {cov_limit} –region {assembly_contig} -f {assembly_fasta}
–bam {bam_child} –bam {bam_parent1} –bam {bam_parent2}
Options ‘–use-best-n-alleles’ and ‘–skip-coverage’ were set following developer recommendations to increase processing speed. Variants were further filtered with BCFtools v1.9 for quality and read depth: ‘QUAL >=10 && INFO/DP<MEAN+3*STDDEV’. Variants were converted into BED format using vcf2bed v2.4.37 (ref. 48) with parameters ‘–snvs’, ‘–insertions’ and ‘–deletions’. The alignment information for lifting variants from the haploid assemblies to the human hg38 reference was generated with minimap v2.17-r941, and the liftover was realized with paftools (part of the minimap package):
minimap2 -t {threads} -cx asm20 –cs –secondary=no -Y -m 10000 -z 10000,50 -r 50000 –end-bonus=100 -O 5,56 -E 4,1 -B 5 ’ hg38.fasta {input_hap_assembly} > {hap-assm}_to_hg38.paf
paftools.js liftover -1 10000 {input_paf} {input_bed} > {output.hg38.bed}
The lifted variants were intersected with our custom set of high-confidence regions using BEDTools ‘intersect’. The total number of base pairs in homozygous variants was then computed as the sum over the length (as reported by FreeBayes as LEN) of all variants located in the high-confidence regions. Because not all variants could be lifted from the haploid to the hg38 reference assembly, we cannot know whether these variants would fall into the ‘high-confidence’ category. We thus computed a second, more conservative, QV estimate counting also all homozygous calls as error that were not lifted to the hg38 reference.
Hi-C based scaffolding and validation
To independently evaluate the accuracy of our scaffolds, we used proximity ligation data for NA12878 and HG00733 (‘Data availability’). By aligning Hi-C data to our scaffolds produced by SaaRclust, we can visually confirm that long-range Hi-C interactions are limited to each cluster reported by SaaRclust.
In addition, we attempted to reproduce Hi-C-based scaffolds presented by Garg et al.12 for NA12878 using 3D-DNA49. Input to this pipeline was created with Juicer50 and an Arima Genomics Hi-C script, which are both publicly available.
Arima script
generate_site_positions_Arima.py -i {squashed_asm} -e {cut-Sequence} -o {cut-sites.txt}
Juicer
juicer.sh -g {genome_id} -s {enzyme} -z {squashed_asm} -r -p {chrom.sizes} -y {cut-sites.txt}
3D-DNA
run-asm-pipeline.sh {squashed_asm} {juicer_merged_no_dups}
SV, indel and SNV detection
Methods for SV, indel and SNV calling are similar to previous HiFi assembly work33 but were adapted for phased assemblies. Variants were called against the GRCh38 primary assembly (that is, no alternate, patch or decoy sequences), which includes chromosomes and unplaced/unlocalized contigs. Mapping was performed with minimap2 2.17 (ref. 51) using parameters –secondary=no -a -t 20 –eqx -Y -x asm20 -m 10000 -z 10000,50 -r 50000 –end-bonus=100 -O 5,56 -E 4,1 -B 5, as described previously33. Alignments were then sorted with SAMtools v1.9 (ref. 52).
To obtain variant calls, alignments were processed with PrintGaps.py, which was derived in the SMRT-SV v2 pipeline (https://github.com/EichlerLab/smrtsv2)53,54, to parse CIGAR string operations to make variant calls30.
Alignment records from assemblies often overlap, which would produce duplicate variant calls with possible different representations (fragmented or shifted). For each haplotype, we constructed a tiling path covering GRCh38 once and traversing loci most centrally located within alignment records. Variants within the path were chosen, and variants outside the tiling path (that is, potential duplicates) were dropped from further analysis.
After obtaining a callset for H1 and H2 independently, we then merged the two haplotypes into a single callset. For homozygous SV and indel calls, an H2 variant must intersect an H1 variant by 1) 50% reciprocal overlap (RO) or 2) within 200 bp and a 50% reciprocal size overlap (RO if variants were shifted to maximally intersect). For homozygous SNV calls, the position and alternate base must match exactly. The result is a unified phased callset containing homozygous and heterozygous variants. Finally, we filtered out variants in pericentromeric loci where callsets are difficult to reproduce54 and loci where we found a collapse in the assembly of either haplotype.
We intersected RefSeq annotations from the UCSC RefSeq track and evaluated the effect on genes noting frameshift SVs and indels in coding regions by quantifying the number of bases affected per variant on genic regions. If an insertion or deletion changes coding sequence for any isoform of a gene by a non-modulo-3 number of bases, we flag the gene as likely disrupted.
Variants falling within TRs and SDs were also annotated using UCSC hg38 tracks. For TR and SD BED files, we merged records allowing regions within 200 bp to overlap with BEDTools47. SVs and indels that were at least 50% contained within an SD or TR region were annotated as SD or TR. For RefSeq analysis, we classified genes as contained within TR or SD by intersecting exons with the collapsed TR and SD regions allowing any overlap.
Phasing accuracy estimates
To evaluate phasing accuracy, we determined SNVs in our phased assemblies based on their alignments to GRCh38. This procedure is described in the ‘SV, indel and SNV detection’ section in the Methods. We evaluate phasing accuracy of our assemblies in comparison to trio-based phasing for HG00733 (ref. 19) and NA12878 (ref. 46). In all calculations, we compare only SNV positions that are shared between our SNV calls and those from trio-based phasing. To count the number of switch errors between our phased assemblies and trio-based phasing, we compare all neighboring pairs of SNVs along each haplotype and recode them into a string of 0s and 1s depending on whether the neighboring alleles are the same (0) or not (1). The absolute number of differences in such binary strings is counted between our haplotypes and the trio-based haplotypes (per chromosome). The switch error rate is reported as a fraction of counted differences of the total number of compared SNVs (per haplotype). Similarly, we calculate the Hamming distance as the absolute number of differences between our SNVs and trio-based phasing (per chromosome) and report it as a fraction of the total number of differences to the total number of compared SNVs (per haplotype).
MHC analysis
We extracted the MHC, defined as chr6:28000000–34000000, by mapping each haplotype sequence against GRCh38 and extracting any primary or supplementary alignments to this region. We created a dotplot for each haplotype’s MHC region using Dot from DNAnexus (https://github.com/dnanexus/dot) (Supplementary Fig. 11). We created phased VCFs for both the CCS and Shasta assemblies using the two haplotype files as input to Dipcall (https://github.com/lh3/dipcall). Then, we compared the phasing between the haplotype files using the compare module within WhatsHap. This results in a switch error rate of 0.48% (six sites) and a Hamming error rate of 0.28% (four sites) from 1,433 common heterozygous sites between the VCFs.
Detection of loss of heterozygosity regions
To localize regions of decreased heterozygosity, we calculated the SNV diversity as a fraction of heterozygous variants between H1 and H2 within 200-kbp-long genomic bins (sliding by 10 kbp). In the next step, we rescaled SNV diversity values to a vector of 0s and 1s by setting values <25th quantile to 0 and those >25th quantile to 1. Then, we used R package fastseg55 to find change points in previously created vector of 0s and 1s while reporting segments of minimal length of 200 (diversity values per bins). In turn, we genotyped each segment based on a median segment value. Segments with median value ≤0.05 were genotyped as ‘LOH’ (loss of heterozygosity), whereas the rest were genotyped as ‘NORM’ (normal level of heterozygosity).
Detection of misassembled contigs
To detect assembly errors in squashed or phased assemblies, we used our SaaRclust package. First, we aligned Strand-seq reads to an assembly in question and then ran SaaRclust with the following parameters:
scaffoldDenovoAssembly(bamfolder = <>, outputfolder = <>, store.data.obj = TRUE, reuse.data.obj = TRUE, pairedEndReads = TRUE, bin.size = 200000, step.size = 200000, prob.th=0.25, bin.method = ’fixed’, ord.method = ’greedy’, min.contig.size = 100000, assembly.fasta = assembly.fasta, concat.fasta = FALSE, num.clusters = 100, remove.always.WC = TRUE, mask.regions = FALSE, desired.num.clusters = 24)
The list of misassembled contigs predicted assembly errors is reported by SaaRclust in RData object with prefix ‘putativeAsmErrors_*’.
Likely disrupted genes
Using RefSeq intersect counts, we found all genes with at least one non-modulo-3 insertion or deletion within the coding region of any isoform (that is, frameshift). We filtered out any genes not fully contained within a consensus region of the two haplotypes, which we defined as regions where both H1 and H2 had exactly one aligned contig. If a gene had multiple non-modulo-3 events, whether in the same isoform or not, the gene was counted once.
Variant comparisons
We compared variants to previously published callsets by intersecting them with the same RO/Size-RO strategy used to merge haplotypes. For HGSVC comparisons, we also excluded variant calls on unplaced contigs, unlocalized contigs and chrY of the reference (that is, chr1-22,X), which were not reported by the HGSVC study. To quantify the number of missed variants proximal to another, we took variants that failed to intersect an HGSVC variant and found the distance to the nearest variant of the same type (INS versus INS and DEL versus DEL).
Robust and reproducible implementation
The basic workflow of our study is implemented in a reproducible and scalable Snakemake56 pipeline that has been successfully tested in compute environments ranging from single servers to high-performance cluster setups (‘Code availability’). Major tasks in the pipeline, such as read alignment or assembly, have been designed as self-contained ‘start-to-finish’ jobs, automating even trivial steps, such as downloading the publicly available datasets used in this study. Owing to the considerable size of the input data, we strongly recommend deploying this pipeline only on compute infrastructure tailored to resource-intensive and highly parallel workloads.
Reporting Summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.