# Practical Graphical Pangenomics

### tools and workflows based on genome variation graphs

## Pangenomic methods

Standard approaches to genome inference and analysis relate sequences to a single linear reference genome.
This is efficient but has a fundamental problem:
Differences from this reference are hard to observe and describe in a coherent way.
Variation and sequence are separated.

Pangenomic methods allow us to relate all genomes or sequences in our analysis directly to each other.
Sequence and variation are combined into a coherent data structure.
This practice is still new, and research into ways to design, implement, and apply this model is ongoing.
However, there is a growing consensus around best practices.
Many methods work on an augmented sequence graph model and use a handful of common data formats for input and
output.

The *variation graph* data model describes the all-to-all alignment of many sequences (genomes or genes for
instance) as walks through a graph whose nodes are labeled with DNA sequences:

Here, we document tools and workflows that operate on this graphical pangenomic data model.
Our goal is to provide greater clarity for students and scientists working with this new paradigm for genomic
research.

## vg

The variation graph toolkit

**vg** provides computational methods for creating and manipulating of genome variation graphs. It's pangenome representation of a set of genomes overcomes reference bias and improves read mapping.
This is highlighted in the

Nature Biotechnology publication.
Users can receive support on

**vg**'s Biostars page.

## PanGenome Graph Builder (pggb)

This pangenome graph construction pipeline renders a collection of sequences into a pangenome graph (in the variation graph model). Its goal is to build a graph that is locally directed and acyclic while preserving large-scale variation. Maintaining local linearity is important for the interpretation, visualization, and reuse of pangenome variation graphs.
A Nextflow version of the pipeline is also available

nf-core/pangenome.

## PanGenome Graph Evaluator (pgge)

This pangenome graph evaluation pipeline measures the reconstruction accuracy of a pangenome graph (in the variation graph model). Its goal is to give guidance in finding the best pangenome graph construction tool for a given input data and task.

## Pangenome Graph Variation Format (PGVF)

PGVF is a hard fork of the GFAv1 format that allows the description of graph-to-graph alignments. It represents a collection of aligned graphs as a network of walks through an underlying merged sequence graph. While pangenome graphs let us represent differences between genomes, we have no mechanism to represent differences between pangenome graphs, or to combine multiple pangenome graphs into one structure without losing information. This motivates the development of a new biological data format.

## xg

The succinct graph index

**xg** presents a static index of nodes, edges and paths of a variation graph.

**xg** can be used to annotate graph nodes with their reference path relative positions.
It was a key component of early development in

**vg**, and was use to scale short read mapping to large genomes.
It implements the

libhandlegraph API.

## odgi

**odgi**, the Optimized Dynamic (genome) Graph Interface, links a thrifty dynamic in-memory variation graph data model to a set of algorithms designed for scalable sorting, pruning, transformation, and visualization of very large genome graphs.

**odgi** includes

python bindings that can be used to

directly interface with its data model.
The

odgi manual provides detailed information about its features and subcommands, including examples.

## GWBT

**GWBT -** **Graph BWT**
is a substring index for paths in a

variation graph.
It is based on the positional Burrows-Wheeler transform (PBWT) and independently implements its graph extension (gPBWT).
The

**GBWT** supports extreme compression of genome sequences, requiring only 1 bit per 1 kilobasepair of sequence to store a 1000 Genomes Project.
For documentation see the

**GBWT** wiki.

## spodgi

**SpOdgi** transforms any

odgi genome variation graph file into a SPARQL capable database.
The RDF semantics are described in the

vg ontology directory.
This transformation allows us to connect variation graphs to other RDF resources, supporting their query using logic programming.
Many operations or queries that are implemented in custom code in other pangenome tools can be expressed in compact SPARQL queries executed against

**SpOdgi**.

## libbdsg

**libbdsg** brings together a collection of dynamic

**HandleGraph** implementations.

**PackedGraph** is designed to have a very low memory footprint.

**HashGraph** is implemented using a collection high-performance hash tables with the goal of providing the highest-possible runtime performance at the cost of increased memory usage.
For more details see the

handle graph API comparison paper. The

bdsg Read the Docs! provides detailed information about starting a project with

**libbdsg**, its python interface, tutorials, and an overview of available methods.

## seqwish

The alignment to variation graph inducer

**seqwish** renders a set of sequences and alignments into the equivalent variation graph.
It accomplishes this using a number of tricks to reduce its memory footprint while maintaining a high degree of parallelism.
The result is entirely dependent on the input alignments, which it represents losslessly.

**seqwish** is generic: it can induce variation graphs from a collection of human genomes, or a set of noisy nanopore reads.

## smoothxg

**smoothxg** finds blocks of paths that are collinear within a variation graph. It applies partial order alignment to each block, yielding an acyclic variation graph. Then, to yield a "smoothed" graph, it walks the original paths to lace these subgraphs together. The resulting graph only contains cyclic or inverting structures larger than the chosen block size, and is otherwise manifold linear.

## maffer

maffer projects between pangenomic variation graphs (stored in

GFAv1 or

xg format), which can be used to encode whole genome alignments, and the multiple alignment format

MAF, which represents only the linearizable components of such an alignment graph.

## GraphAligner

The sequence to graph aligner

**GraphAligner** implements a novel,

high performance alignment algorithm capable of aligning to graphs of arbitrary topological complexity with minimal overhead relative to a linear aligner.
Its seeding strategy, which is based on exact matches (minimizers) in whole nodes, limits it to longer reads.
It produces GAM and GAF alignment formats compatible with other pangenome graph based tools.

## Pangenomic data formats

Graphical pangenomes are usually exchanged using a subset of

**GFAv1 - Graphical Fragment Assembly** format.
Graph nodes are stored in sequence records (S), edges represented in link (L) records, and embedded sequences in path records (P).
Mappings to

**GFA** can be encoded in

**GAM** (Graph Alignment/Map format,

vg's BAM equivalent) or the text-based

**GAF** (Graph Alignment Format).

## Sequence Tube Map

**Sequence Tube Map** is a javascript module visualizing variation graphs in a

*tube-map-like* layout.
It renders variation graphs using a

"tube map" model in which paths representing genomes flow through the sequence nodes of the graph.
Currently, it can only handle graphs created with

vg.

## Pantograph

The

**Pantograph** project
aims to build an interactive pangenome visualization tool for COVID-19 data that includes annotation and metadata.
On the long run, it should be capable to visualize a pangenome of 1,000s of individuals and gigabase genomes,
scaling from nucleotide to whole chromosome level.

## Bandage

Originally developed for assembly graph visualization,

**Bandage** is an indispensable tool for visual inspection of variation graphs as well as assembly graphs.

## GfaViz

**GfaViz** is an interactive tool for the 2D visualization of sequence graphs, scaffolding graphs, alignment graphs, splicing graphs and variation graphs.
One of it's unique featues is the interactive 2D visualization of the paths of a graph.

## MoMI-G

**MoMI-G - MOdular Multi-scale Integrated Genome graph browser** is a

*mult-view*
graph browser combining the base-level differences of

Sequence Tube Map
with a

CIRCOS plot of chromosomal-scale connections and an interval card deck to efficiently browse structural variants.
It displays evidences such as short and long read alignments, read depth, and annotations.

## vgan

**vgan** is a suite of tools for pangenomics built on top of

**vg**. `Haplocart` predicts the mitochondrial haplogroup for reads originating from uncontaminated modern human samples. `Euka` scans ancient environmental DNA samples for arthropodic and tetrapodic mitochondrial DNA using a variation graph as the reference.