Variant mapping¶

MaveDB uses the variant descriptions provided in the score and count data tables to map variants to genomic coordinates on the human reference genome. This mapping allows MaveDB to integrate with external resources such as ClinGen, ClinVar, and gnomAD, powers the MaveMD variant search, and enables score calibrations for clinical interpretation.

Note

Variant mapping is only performed for datasets with a human target sequence.

This process was developed by Arbesfeld et al. (2025) and is described in detail in the linked publication. The sections below summarize the method and its key considerations.

Why mapping is needed¶

Most MAVE experiments describe variants relative to an assay-specific target sequence uploaded by the data submitter. This target sequence is often not identical to a human reference sequence — it may be codon-optimized for expression in a model organism, contain synthetic elements such as minigene constructs, or represent only a portion of the full gene. Additionally, protein-level variants from cDNA-based assays may span exon boundaries when represented at the genomic level.

These differences mean that MAVE variant descriptions cannot be directly compared to variants reported by clinical sequencing pipelines or described in databases like ClinVar. Mapping resolves this by translating each MAVE variant from the experimental target sequence coordinate system to standard human reference coordinates (GRCh38), producing both pre-mapped (target-relative) and post-mapped (reference-relative) representations.

Mapping process¶

Variants are mapped to genomic coordinates upon upload of a score set. The mapping process involves the following steps:

flowchart LR
    A["Target sequence<br>alignment<br>(BLAT → GRCh38)"] --> B["Transcript<br>selection<br>(MANE Select)"]
    B --> C["Variant<br>translation<br>(MAVE-HGVS → HGVS)"]
    C --> D["VRS<br>translation<br>(HGVS → VRS)"]

Target sequence alignment: The target sequence provided with the score set is aligned to the GRCh38 human genome assembly using BLAT. This determines the genomic location of the target sequence and identifies candidate transcripts. For targets specified at the amino acid level, BLAT aligns to the protein reference space; for nucleotide targets, alignment is performed directly against the genome.
Transcript selection: From the candidate alignments, MaveDB selects a representative RefSeq transcript. The selection prioritizes MANE Select transcripts (the community standard for reporting clinical variants), followed by RefSeq Select transcripts, and then the longest matching transcript. An offset is computed to determine the precise location of the MAVE sequence within the selected reference sequence.
Variant translation: Each variant in the score and count data tables is converted from MAVE-HGVS format to standard HGVS format and translated with respect to the selected transcript. This step accounts for any offset between the target sequence and the transcript.
VRS translation: The HGVS variant descriptions are converted to GA4GH VRS format using the VRS-Python library. Each variant receives a unique, computable VRS digest identifier for both its pre-mapped and post-mapped forms, enabling precise identification and data provenance.

Review mapping results

In some cases, variants may not be successfully mapped due to issues such as ambiguous target sequences, complex variant types, or discrepancies between the target and reference genome. MaveDB logs these instances and provides feedback to data contributors to help resolve mapping issues.

Although some mapping failures represent true limitations of the data, others can be addressed by correcting errors in the submitted variants or target sequences.

It is highly recommended that data contributors review the mapping results after uploading a score set to ensure that variants have been accurately mapped. Contributors can view mapping results on the score set page and download a report of mapped and unmapped variants.

Mapping failures do not prevent datasets from being published in MaveDB, but mapped variants are required for certain features such as variant search, linkages with certain external resources, and inclusion in MaveMD.

Concordance and discordance¶

Each mapped variant is represented as a pair: the pre-mapped form (relative to the MAVE target sequence) and the post-mapped form (relative to the human reference). When the reference alleles at both positions are identical, the mapping is concordant. When they differ — for example, because the target sequence was codon-optimized or contained synthetic elements — the mapping is discordant.

Discordant mappings are not necessarily errors. They arise naturally from legitimate differences between experimental and reference sequences, such as:

Codon optimization — Synonymous nucleotide changes introduced to optimize expression in the assay system.
Non-homologous sequence content — Synthetic elements like minigene constructs that do not align to the human genome.
Exon boundary spanning — Protein-level changes that, when mapped to the genome, correspond to nucleotide changes across exon-intron boundaries.

Both concordant and discordant mappings are preserved in MaveDB, and the pre-mapped representation is always retained so that the original experimental context is available.

Data provenance¶

An important design goal of the mapping process is preserving data provenance. Each mapped variant retains both its pre-mapped and post-mapped VRS representations, each with a unique digest identifier. This ensures that:

The original experimental sequence context is never lost.
Downstream users can assess the degree of concordance between the target and reference sequences.
Clinical users can verify that the experimental evidence is appropriate for their specific interpretation context.

This is particularly important for clinical applications, where understanding the relationship between the assay system and the human reference is essential for appropriately applying functional evidence.

Downstream integrations¶

Mapped variants are integral to MaveDB's integration with external data sources. Mapped variants enable:

ClinGen Allele Registry — Registration of variants and assignment of ClinGen Allele IDs (CAIDs), which serve as universal identifiers across clinical genomics resources.
ClinGen Linked Data Hub — Submission of MAVE functional evidence linked to CAIDs, making it available alongside other variant curation data.
ClinVar — Cross-referencing of mapped variants with clinical significance classifications.
gnomAD — Retrieval of population allele frequency data for mapped variants.
Ensembl VEP — Annotation of mapped variants with predicted functional consequences displayed on score set pages.

These integrations also enable the MaveMD clinical interface, including features like variant search and score calibrations.

Programmatic access¶

Mapped variants are available through the MaveDB API via the /mapped-variants endpoint, which returns pre-mapped and post-mapped VRS objects as JSON. Mapped variant files are also downloadable from individual score set pages in VRS JSON format.