odgi sort

sort a variation graph

SYNOPSIS

odgi sort [-i, –idx=FILE] [-o, –out=FILE] [OPTION]…

DESCRIPTION

The odgi sort(1) command sorts a succinct variation graph. Odgi sort offers a diverse palette of sorting algorithms to determine the node order:

  • A topological sort: A graph can be sorted via breadth-first search (BFS) or depth-first search (DFS). Optionally, a chunk size specifies how much of the graph to grab at once in each topological sorting phase. The sorting algorithm will continue the sort from the next node in the prior graph order that has not been sorted, yet. The cycle breaking algorithm applies a DFS sort until a cycle is found. We break and start a new DFS sort phase from where we stopped.

  • A random sort: The graph is randomly sorted. The node order is randomly shuffled from Mersenne Twister pseudo-random generated numbers.

  • A 1D linear SGD sort: Odgi implements a 1D linear, variation graph adjusted, multi-threaded version of the Graph Drawing by Stochastic Gradient Descent algorithm. The force-directed graph drawing algorithm minimizes the graph’s energy function or stress level. It applies stochastic gradient descent (SGD) to move a single pair of nodes at a time.

  • A path guided, 1D linear SGD sort: Odgi implements a 1D linear, variation graph adjusted, multi-threaded version of the Graph Drawing by Stochastic Gradient Descent algorithm. The force-directed graph drawing algorithm minimizes the graph’s energy function or stress level. It applies stochastic gradient descent (SGD) to move a single pair of nodes at a time. The path index is used to pick the terms to move stochastically. If ran with 1 thread only, the resulting order of the graph is deterministic. The seed is adjustable.

Sorting the paths in a graph my refine the sorting process. For the users’ convenience, it is possible to specify a whole pipeline of sorts within one parameter.

OPTIONS

Graph Files IO

-i, –idx=FILE
File containing the succinct variation graph to sort. The file name usually ends with .og.
-o, –out=FILE
Write the sorted dynamic succinct variation graph to this file. A file ending with .og is recommended.
-s, –sort-order=FILE
File containing the sort order. Each line contains one node identifier.

Topological Sorts

-b, –breadth-first
Use a (chunked) breadth first topological sort.
-B, –breadth-first-chunk=N
Chunk size for breadth first topological sort. Specify how many nucleotides to grap at once in each BFS phase.
-z, –depth-first
Use a (chunked) depth first topological sort.
-Z, –depth-first-chunk=N
Chunk size for the depth first topological sort. Specify how many nucleotides to grap at once in each DFS phace.
-w, –two-way
Use a two-way topological algorithm for sorting. It is a maximum of head-first and tail-first topological sort.
-n, –no-seeds
Don’t use heads or tails to seed topological sort.
-c, –cycle-breaking
Use a cycle breaking sort.

Random Sort

-r, –random
Randomly sort the graph.

Path Guided 1D Linear SGD Sort

-Y, –path-sgd
Apply path guided 1D linear SGD algorithm to organize the graph.
-X, –path-index=FILE
Load the path index from this FILE.
-f, –path-sgd-use-paths=FILE
Specify a line separated list of paths to sample from for the on the fly term generation process in the path guided linear 1D SGD. The default value are all paths.
-G, –path-sgd-min-term-updates-paths=N
The minimum number of terms to be updated before a new path guided linear 1D SGD iteration with adjusted learning rate eta starts, expressed as a multiple of total path steps. The default value is 1.0. Can be overwritten by -U, -path-sgd-min-term-updates-nodes=N.
-U, –path-sgd-min-term-updates-nodes=N
The minimum number of terms to be updated before a new path guided linear 1D SGD iteration with adjusted learning rate eta starts, expressed as a multiple of the number of nodes. Per default, the argument is not set. The default of -G, path-sgd-min-term-updates-paths=N is used).
-j, –path-sgd-delta=N
The threshold of maximum displacement approximately in bp at which to stop path guided linear 1D SGD. Default values is 0.0.
-g, –path-sgd-eps=N
The final learning rate for path guided linear 1D SGD model. The default value is 0.01.
-v, –path-sgd-eta-max=N
The first and maximum learning rate for path guided linear 1D SGD model. The default value is squared steps of longest path in graph.
-a, –path-sgd-zipf-theta=N
The theta value for the Zipfian distribution which is used as the sampling method for the second node of one term in the path guided linear 1D SGD model. The default value is 0.99.
-x, –path-sgd-iter-max=N
The maximum number of iterations for path guided linear 1D SGD model. The default value is 30.
-F, –iteration-max-learning-rate=N
The iteration where the learning rate is max for path guided linear 1D SGD model. The default value is 0.
-k, –path-sgd-zipf-space=N
The maximum space size of the Zipfian distribution which is used as the sampling method for the second node of one term in the path guided linear 1D SGD model. The default value is the longest path length.
-I, –path-sgd-zipf-space-max=N
The maximum space size of the Zipfian distribution beyond which quantization occurs. Default value is 100.
-l, –path-sgd-zipf-space-quantization-step=N
Quantization step size when the maximum space size of the Zipfian distribution is exceeded. Default value is 100.
-y, –path-sgd-zipf-max-num-distributions=N
Approximate maximum number of Zipfian distributions to calculate. The default value is 100.
-q, –path-sgd-seed=N
Set the seed for the deterministic 1-threaded path guided linear 1D SGD model. The default value is pangenomic!.
-u, –path-sgd-snapshot=STRING
Set the prefix to which each snapshot graph of a path guided 1D SGD iteration should be written to. This is turned off per default. This argument only works when -Y, –path-sgd was specified. Not applicable in a pipeline of sorts.

Path Sorting Options

-L, –paths-min
Sort paths by their lowest contained node identifier.
-M, –paths-max
Sort paths by their highest contained node identifier.
-A, –paths-avg
Sort paths by their average contained node identifier.
-R, –paths-avg-rev
Sort paths in reverse by their average contained node identifier.
-D, –path-delim=path-delim
Sort paths in bins by their prefix up to this delimiter.

Pipeline Sorting

-p, –pipeline=STRING
Apply a series of sorts, based on single character command line arguments given to this command. The default sort is s. The reverse sort would be specified via f.

Additional Parameters

-d, –dagify-sort
Sort on the basis of a DAGified graph.
-O, –Optimize
Use the MutableHandleGraph::optimize method to compact the node identifier space.

Threading

-t, –threads=N
Number of threads to use for the parallel operations.

Processing Information

-P, –progress
Print sort progress to stdout.

Program Information

-h, –help
Print a help message for odgi sort.