Your guide to what's next.
Scalable genomic alignment with Progressive Cactus
Dec 9, 2020

Scalable genomic alignment with Progressive Cactus

How progressive alignment makes it possible to efficiently align hundreds to thousands of large genomes.

An important method in comparative genomics and evolutionary studies, multiple genomic alignments attempt to map all regions in each of the input genomes to the corresponding segments in every other genome. Such alignments help understand the relationships between those segments and unlock key insights into genome evolution.

With the growing number of published genomic sequences, many studies seek to analyse increasingly large sets of complex genomes. This means that multiple genome alignment tools need to scale to handle the ever growing sets of input genomes.

An important class of multiple genome alignment tools are reference-free aligners, also known as non-reference-based aligners, which do not require a reference sequence for constructing the alignment. One such tool, Cactus, provides highly accurate alignment results and has been shown to outperform it peers.

The original implementation of Cactus dates back to 2012 and since then, it has been used in many genomic projects and studies. The runtime requirements of Cactus, however, increase quadratically with the total number of input bases which means that it cannot, for example, be used to align any more than 10 large vertebrate genomes.

Progressive Cactus is the new extension of the Cactus aligner designed to perform well on large sets of input genomes (hundreds to thousands of large genomes). Unlike its predecessor, Progressive Cactus implements a linear-time "progressive" algorithm which recursively breaks down the multiple alignment problem into smaller subproblems with the resulting sub-alignments being aligned back together to form the final alignment output.

Bottom line

Genome alignment is the sine qua non of comparative genomics and evolutionary studies. Due to the increasing scale of such studies, genome alignment tools must continually improve to cope with the ever growing complexity of multiple genome alignment problems.

By implementing the progressive alignment strategy, Progressive Cactus becomes suitable for aligning hundreds to thousands of large input genomes and provides the opportunity to uncover new insights into genome evolution and natural history.

See also
Eckher Genome Browser
Explore the human genome online.
The ambitious challenge of finishing the human genome
Generating a complete human genome sequence, chromosome by chromosome.
WikiPathways: A Wikipedia for biological pathways
An overview of the collaboratively edited structured pathway encyclopedia.
Eckher Sequence Alignment Viewer
Explore sequence alignments in detail with Eckher Sequence Alignment Viewer.
The RDF model of the Gene Ontology, demystified
An outline of the structure of the Gene Ontology RDF graph and ways to query it.
AstraZeneca's knowledge graph: Drug discovery is a lot about connections
The biomedical knowledge graph built by AstraZeneca helps the company find new drugs and drug targets.