Practical Graphical Pangenomics
tools and workflows based on genome variation graphs
Pangenomic methods
Standard approaches to genome inference and analysis relate sequences to a single linear reference genome.
This is efficient but has a fundamental problem:
Differences from this reference are hard to observe and describe in a coherent way.
Variation and sequence are separated.
Pangenomic methods allow us to relate all genomes or sequences in our analysis directly to each other.
Sequence and variation are combined into a coherent data structure.
This practice is still new, and research into ways to design, implement, and apply this model is ongoing.
However, there is a growing consensus around best practices.
Many methods work on an augmented sequence graph model and use a handful of common data formats for input and
output.
The variation graph data model describes the all-to-all alignment of many sequences (genomes or genes for
instance) as walks through a graph whose nodes are labeled with DNA sequences:
Here, we document tools and workflows that operate on this graphical pangenomic data model.
Our goal is to provide greater clarity for students and scientists working with this new paradigm for genomic
research.
vg
The variation graph toolkit
vg provides computational methods for creating and manipulating of genome variation graphs. It's pangenome representation of a set of genomes overcomes reference bias and improves read mapping.
This is highlighted in the
Nature Biotechnology publication.
Users can receive support on
vg's Biostars page.
PanGenome Graph Builder (pggb)
This pangenome graph construction pipeline renders a collection of sequences into a pangenome graph (in the variation graph model). Its goal is to build a graph that is locally directed and acyclic while preserving large-scale variation. Maintaining local linearity is important for the interpretation, visualization, and reuse of pangenome variation graphs.
A Nextflow version of the pipeline is also available
nf-core/pangenome.
PanGenome Graph Evaluator (pgge)
This pangenome graph evaluation pipeline measures the reconstruction accuracy of a pangenome graph (in the variation graph model). Its goal is to give guidance in finding the best pangenome graph construction tool for a given input data and task.
Pangenome Graph Variation Format (PGVF)
PGVF is a hard fork of the GFAv1 format that allows the description of graph-to-graph alignments. It represents a collection of aligned graphs as a network of walks through an underlying merged sequence graph. While pangenome graphs let us represent differences between genomes, we have no mechanism to represent differences between pangenome graphs, or to combine multiple pangenome graphs into one structure without losing information. This motivates the development of a new biological data format.
xg
The succinct graph index
xg presents a static index of nodes, edges and paths of a variation graph.
xg can be used to annotate graph nodes with their reference path relative positions.
It was a key component of early development in
vg, and was use to scale short read mapping to large genomes.
It implements the
libhandlegraph API.
odgi
odgi, the Optimized Dynamic (genome) Graph Interface, links a thrifty dynamic in-memory variation graph data model to a set of algorithms designed for scalable sorting, pruning, transformation, and visualization of very large genome graphs.
odgi includes
python bindings that can be used to
directly interface with its data model.
The
odgi manual provides detailed information about its features and subcommands, including examples.
GWBT
GWBT - Graph BWT
is a substring index for paths in a
variation graph.
It is based on the positional Burrows-Wheeler transform (PBWT) and independently implements its graph extension (gPBWT).
The
GBWT supports extreme compression of genome sequences, requiring only 1 bit per 1 kilobasepair of sequence to store a 1000 Genomes Project.
For documentation see the
GBWT wiki.
spodgi
SpOdgi transforms any
odgi genome variation graph file into a SPARQL capable database.
The RDF semantics are described in the
vg ontology directory.
This transformation allows us to connect variation graphs to other RDF resources, supporting their query using logic programming.
Many operations or queries that are implemented in custom code in other pangenome tools can be expressed in compact SPARQL queries executed against
SpOdgi.
libbdsg
libbdsg brings together a collection of dynamic
HandleGraph implementations.
PackedGraph is designed to have a very low memory footprint.
HashGraph is implemented using a collection high-performance hash tables with the goal of providing the highest-possible runtime performance at the cost of increased memory usage.
For more details see the
handle graph API comparison paper. The
bdsg Read the Docs! provides detailed information about starting a project with
libbdsg, its python interface, tutorials, and an overview of available methods.
seqwish
The alignment to variation graph inducer
seqwish renders a set of sequences and alignments into the equivalent variation graph.
It accomplishes this using a number of tricks to reduce its memory footprint while maintaining a high degree of parallelism.
The result is entirely dependent on the input alignments, which it represents losslessly.
seqwish is generic: it can induce variation graphs from a collection of human genomes, or a set of noisy nanopore reads.
smoothxg
smoothxg finds blocks of paths that are collinear within a variation graph. It applies partial order alignment to each block, yielding an acyclic variation graph. Then, to yield a "smoothed" graph, it walks the original paths to lace these subgraphs together. The resulting graph only contains cyclic or inverting structures larger than the chosen block size, and is otherwise manifold linear.
maffer
maffer projects between pangenomic variation graphs (stored in
GFAv1 or
xg format), which can be used to encode whole genome alignments, and the multiple alignment format
MAF, which represents only the linearizable components of such an alignment graph.
GraphAligner
The sequence to graph aligner
GraphAligner implements a novel,
high performance alignment algorithm capable of aligning to graphs of arbitrary topological complexity with minimal overhead relative to a linear aligner.
Its seeding strategy, which is based on exact matches (minimizers) in whole nodes, limits it to longer reads.
It produces GAM and GAF alignment formats compatible with other pangenome graph based tools.
Pangenomic data formats
Graphical pangenomes are usually exchanged using a subset of
GFAv1 - Graphical Fragment Assembly format.
Graph nodes are stored in sequence records (S), edges represented in link (L) records, and embedded sequences in path records (P).
Mappings to
GFA can be encoded in
GAM (Graph Alignment/Map format,
vg's BAM equivalent) or the text-based
GAF (Graph Alignment Format).
Sequence Tube Map
Sequence Tube Map is a javascript module visualizing variation graphs in a
tube-map-like layout.
It renders variation graphs using a
"tube map" model in which paths representing genomes flow through the sequence nodes of the graph.
Currently, it can only handle graphs created with
vg.
Pantograph
The
Pantograph project
aims to build an interactive pangenome visualization tool for COVID-19 data that includes annotation and metadata.
On the long run, it should be capable to visualize a pangenome of 1,000s of individuals and gigabase genomes,
scaling from nucleotide to whole chromosome level.
Bandage
Originally developed for assembly graph visualization,
Bandage is an indispensable tool for visual inspection of variation graphs as well as assembly graphs.
GfaViz
GfaViz is an interactive tool for the 2D visualization of sequence graphs, scaffolding graphs, alignment graphs, splicing graphs and variation graphs.
One of it's unique featues is the interactive 2D visualization of the paths of a graph.
MoMI-G
MoMI-G - MOdular Multi-scale Integrated Genome graph browser is a
mult-view
graph browser combining the base-level differences of
Sequence Tube Map
with a
CIRCOS plot of chromosomal-scale connections and an interval card deck to efficiently browse structural variants.
It displays evidences such as short and long read alignments, read depth, and annotations.
vgan
vgan is a suite of tools for pangenomics built on top of
vg. `Haplocart` predicts the mitochondrial haplogroup for reads originating from uncontaminated modern human samples. `Euka` scans ancient environmental DNA samples for arthropodic and tetrapodic mitochondrial DNA using a variation graph as the reference.