Extract selected loci¶
Synopsis¶
Pangenome graphs model the full set of genomic elements in a gives species or clade. Nevertheless, downstream analyses
may require focusing on specific pangenomic regions. odgi allows to extract loci of interest from the pangenome graph,
resulting in sub-graphs of the full pangenome graph. It is demonstrated that one can work with such sub-graphs as easily as
handling the full ones.
Steps¶
Build the Lipoprotein A graph¶
Assuming that your current working directory is the root of the odgi project, to construct an odgi file from the
LPA dataset in GFA format, execute:
odgi build -g test/LPA.gfa -o LPA.og
The command creates a file called LPA.og, which contains the input graph in odgi format. This graph contains
13 contigs from 7 haploid human genome assemblies from 6 individuals plus the chm13 cell line. The contigs cover the
Lipoprotein A (LPA) locus, which encodes the
Apo(a) protein.
Display path names¶
The path’s names are:
odgi paths -i LPA.og -L
chm13__LPA__tig00000001
HG002__LPA__tig00000001
HG002__LPA__tig00000005
HG00733__LPA__tig00000001
HG00733__LPA__tig00000008
HG01358__LPA__tig00000002
HG01358__LPA__tig00000010
HG02572__LPA__tig00000005
HG02572__LPA__tig00000001
NA19239__LPA__tig00000002
NA19239__LPA__tig00000006
NA19240__LPA__tig00000001
NA19240__LPA__tig00000012
Obtain the Lipoprotein A variants¶
To see the variants for the two contigs of the HG02572 sample, execute:
zgrep -v '^##' test/LPA.chm13__LPA__tig00000001.vcf.gz | head -n 9 | cut -f 1-9,16,17 | column -t
The pangenome graphs embed the full mutual relationships between the embedded genomes and their variation. In the following example,
the variants were called with respect to the chm13__LPA__tig00000001 contig, which was used as the reference path.
#CHROM                   POS   ID      REF  ALT  QUAL  FILTER  INFO                        FORMAT  HG02572__LPA__tig00000001  HG02572__LPA__tig00000005
chm13__LPA__tig00000001  6     >3>6    T    A    60    .       AC=1;AF=0.5;AN=2;LV=0;NS=2  GT      1                          0
chm13__LPA__tig00000001  134   >6>8    G    GT   60    .       AC=1;AF=0.5;AN=2;LV=0;NS=2  GT      1                          0
chm13__LPA__tig00000001  156   >8>11   A    G    60    .       AC=1;AF=0.5;AN=2;LV=0;NS=2  GT      0                          1
chm13__LPA__tig00000001  252   >11>13  GT   G    60    .       AC=2;AF=1;AN=2;LV=0;NS=2    GT      1                          1
chm13__LPA__tig00000001  265   >13>16  T    A    60    .       AC=2;AF=1;AN=2;LV=0;NS=2    GT      1                          1
chm13__LPA__tig00000001  371   >16>18  AT   A    60    .       AC=2;AF=1;AN=2;LV=0;NS=2    GT      1                          1
chm13__LPA__tig00000001  996   >18>21  AG   ATT  60    .       AC=2;AF=1;AN=2;LV=0;NS=2    GT      1                          1
chm13__LPA__tig00000001  1050  >21>23  G    GT   60    .       AC=1;AF=0.5;AN=2;LV=0;NS=2  GT      1                          0
The ID field in the VCF lists the nodes involved in the variant. A > means that the node is visited in forward
orientation, a < means that the node is visited in reverse orientation.
Extract a sub-graph with a variant inside¶
The insertion at position 1050 (G > GT, the last line in the VCF snippet above) is present only in one of the  HG02572’s contig (HG02572__LPA__tig00000001).
To extract the sub-graph where this insertion falls, execute:
odgi extract -i LPA.og -n 23 -c 1 -o LPA.21_23_G_GT.og
The instruction extracts:
- The node with ID 23 ( - -n 23),
- the nodes reachable from this node following a single edge ( - -c 1) in the graph topology,
- the edges connecting all the extracted nodes, and 
- the paths traversing all the extracted nodes. 
To have basic information on the sub-graph, execute:
odgi stats -i LPA.21_23_G_GT.og -S
#length nodes   edges   paths
644     5       6       3
The extracted path’s names are:
odgi paths -i LPA.21_23_G_GT.og -L
chm13__LPA__tig00000001:997-1640
HG02572__LPA__tig00000005:999-1641
HG02572__LPA__tig00000001:1035-1678
The sub-graph contains the contig used as a reference in the VCF file, and the two HG02572’s contigs.
Visualize the sub-graph¶
To visualize the sub-graph, we can also use external tools as Bandage, which
supports grpahs in GFA format. To covert the graph in odgi format in a graph in GFA format, execute:
odgi view -i LPA.21_23_G_GT.og -g > LPA.21_23_G_GT.gfa
Then, open the LPA.21_23_G_GT.gfa file with Bandage.
 
The image shows the graph topology, where each colored rectangle represents a node. In particular, three paths support
nodes with ID 21 and 23, and only one path supports the node with ID 22. The node with ID 22 represents in the graph the
additional nucleotide T presents in the HG02572__LPA__tig00000001 contig as an insertion.
Get the Human chr6 dataset¶
Download the pangenome graph of the Human chromosome 6
in GFA format, decompress it, and convert it to a graph in odgi format:
wget -c https://s3-us-west-2.amazonaws.com/human-pangenomics/pangenomes/scratch/2021_05_06_pggb/gfas/chr6.pan.gfa.gz
gunzip chr6.pan.gfa.gz
odgi build -g chr6.pan.gfa -o chr6.pan.og --threads 2 -P
The last command creates a file called chr6.pan.og, which contains the input graph in odgi format. This graph contains contigs of
88 haploid, phased human genome assemblies from 44 individuals, plus the chm13 and GRCh38 reference genomes.
Extract the MHC locus¶
The major histocompatibility complex (MHC) is a large locus in vertebrate DNA containing a set of closely linked polymorphic genes that code for cell surface proteins essential for the adaptive immune system. In humans, the MHC region occurs on chromosome 6. The human MHC is also called the HLA (human leukocyte antigen) complex (often just the HLA).
Assuming that your current working directory is the root of the odgi project, to see the coordinates of some HLA genes,
execute:
head test/chr6.HLA_genes.bed -n 5
grch38#chr6     29722775        29738528        HLA-F
grch38#chr6     29826967        29831125        HLA-G
grch38#chr6     29941260        29945884        HLA-A
grch38#chr6     30489509        30494194        HLA-E
grch38#chr6     31268749        31272130        HLA-C
The coordinates are expressed with respect to the GRCh38 reference genome.
Extract a sub-graph with the HLA genes¶
To extract the sub-graph containing all the HLA genes annotated in the chr6.HLA_genes.bed file, execute:
odgi extract -i chr6.pan.og -o chr6.pan.MHC.og -b chr6.HLA_genes.bed  -E --threads 2 -P
The instruction extracts:
- The nodes belonging to the - grch38#chr6path ranges specified in the the- chr6.HLA_genes.bedfile via- -b,
- all nodes between the min and max positions touched by the given path ranges, also if they belong to other paths ( - -E),
- the edges connecting all the extracted nodes, and 
- the paths traversing all the extracted nodes. 
To have basic information on the sub-graph, execute:
odgi stats -i chr6.pan.MHC.og -S
#length     nodes   edges   paths
3896981     216352  297890  97
There are 97 paths in the sub-graph. This means that for few individuals, more than one contig covers the MHC locus.
Visualize the sub-graph¶
To visualize the sub-graph with odgi, execute:
odgi sort -i chr6.pan.MHC.og -o - -O | \
    odgi viz -i - -o chr6.pan.MHC.png -s '#' -P 20
To obtain the following PNG image:
 
In this 1-Dimensional visualization all contigs of the same haplotype are represented with the same color (-s '#').
The majority of the haplotypes has one contig covering the whole locus, meanwhile in few of them, the locus is split
in several contigs. We had to apply odgi sort here in order to optimize (-O) our sub-graph. This ensures that
the node identifier space is compacted from one to the number of nodes in the sub-graph.
To see the haplotypes sorted by the number of contigs covering the MHC locus, execute:
odgi paths -i chr6.pan.MHC.og -L | cut -f 1,2 -d '#' | uniq -c | sort -k 1nr | head
3 HG00733#2
2 HG00673#2
2 HG01071#1
2 HG02630#2
2 HG02818#2
2 HG03516#2
1 chm13#chr6:29595119-32911317
1 grch38#chr6:29722774-33089734
1 HG00438#1
1 HG00438#2