Resources

Computational Methods

A selection of computational methods that we have either developed or used in our research. For a more comprehensive review of methods, see Tanudisastro et al. (2024).

Short-read genotyping

Short-read novel locus detection

Long-read genotyping

Disease Repeat Catalogs

Genome Repeat Catalogs

Databases

Why do repeat locus definitions differ between resources?

There are often multiple ways to define a given repeat locus. For example, defining a locus as a stretch of perfectly repeating motifs tends to result in a narrower locus than strategies allowing for interruptions. In coding regions, the locus boundary might be chosen to align with the reading frame.

There are often multiple “correct” ways to define a given repeat locus, however careful consideration must be made to the downstream uses of the data. In particular, genotyping accuracy can be affected by the choice of locus definition. Locus definitions can affect the expected allele size, which in turn may have an impact on how allelic thresholds are defined and determined.

The loci in STRchive were defined to be broader, allowing for some interruptions, and with consideration to the biological context and clinical utility of the locus. This strategy increases the chance that the locus will overlap with a relevant variant call when STRchive is used to annotate a VCF file. It is also the preferred approach to defining loci for improved genotyping accuracy with TRGT. In contrast, ExpansionHunter tends to perform better with narrower loci that exclude repeat interruptions. For this reason, repeat definitions used in gnomAD tend to be narrower than those used in STRchive.

Defining loci in STRchive

We followed a consistent protocol to arrive at the coordinates within our repeat locus definitions. We began with the region defined in the literature and used TRF annotations or similar to identify or refine the coordinates if needed. We examined the reference genome at these coordinates to find the longest exact stretch of the motif (allowing for “N” bases if relevant, e.g. polyalanine loci) and included partial motifs, i.e., the total length does not need to be a multiple of the motif length. We included interrupted motifs if described in the literature or where they occurred between stretches of the same motif. We defined the coordinates in the hg38 reference genome first and then lifted over to the other genomes. We conducted a manual review to ensure the coordinates in all genome builds met the above criteria, adjusting as necessary. To generate TRGT repeat definitions, we further extended the above coordinates to include flanking motifs.

Additional considerions for specific catalogs:

TRGT: Coordinates were extended to include any flanking repeats. All relevent motifs are given where possible.
Atarva: Flanking repeats were included on separate lines. Only one motif is given per locus as the tool infers other motifs automatically.
LongTR: 1-based coordinates (i.e. non-standard BED format). All relevent motifs are given where possible.
Straglr: Uses the version of this format required for wf-human-variation. Some loci are ommitted if they cannot match the format e.g. where only motif change or contraction is pathogenic.
Stranger: Coordinates and names are set to match the Stragler catalog so that they can be used together. Uses the version of this format required for wf-human-variation.

Blueprint for STR evaluation/interpretation

With current resources relevant to each point.

Blueprint