The Story of kSNP4

Phylogenetic analysis, the process of determining the relationships of organisms to each other and to their common ancestors, is not only an essential part of understanding evolution it is an essential part of tracing and understanding outbreaks of pathogenic microorganisms. Molecular phylogenetics uses the sequences of genes and proteins to determine those relationships. For microorganisms analyzing gene sequences provided the most reliable information.

As genes diverge they occasionally experience small insertions and deletions so the first step in phylogenetic analysis of genes is to align the gene sequences by introducing gaps so that homologous sites are compared by writing them above each other. For a long time the development of sophisticated alignment programs was a major focus of effort in the fields of Evolution, Systematics and Bioinformatics. Those programs could align sequences of a few thousand nucleotides. In the 1990’s new technologies permitted sequencing not just individual genes but entire microbial genomes consisting of five to six million nucleotides.

It was initially assumed that aligning genome sequence would just require improved alignment algorithms and bigger, faster computers. That turned out not to be the case. Aligning more than 30 genomes was not possible, so it appeared that it would not be possible to do genome-based phylogenetic analysis of any but the smallest data sets.

One of the reasons for the problem was the unexpected finding that the genomes of different isolates of the same species were far more different from each other than expected. Not only did they differ by base substitutions and insertions/deletions, they differed enormously in gene content. There were some genes, the core genes, that were present in all isolates of a species, but there were also thousands of genes that were present in some isolates but missing in others. Even the core genes were not present in the same order in different genomes. As complete genome sequences continued to be added to the NCBI databases at an accelerating rate researchers were in the position of starving while staring at a banquet  over an insurmountable wall.

Dr. Shea Gardner of the Lawrence Livermore Laboratory’s Computations/Global Security division was in serious need of being able to analyze hundreds of genomes that might be used in biowarfare. Completely frustrated by the situation she had an insight that would change the fields of Bioinformatics and Phylogenetics forever: genome alignment was just one method of identifying homologous sites and any method that would serve that purpose would meet her needs. She devised an algorithm that would identify the homologous sites that differ among genomes without genome alignment. She reasoned that if a short sequence of 13 to 25 nucleotides, called a kmer, was identical at all but one base in each genome it could not be identical by chance and the kmer must be homologous wherever it occurred in the different genomes. The non-identical base called a Single Nucleotide Polymorphism, or SNP, was a homologous site to be used as information from which to estimate a phylogeny.  SNPs, written above each other, were exactly equivalent to the homologous sites in an alignment.

Shea was an expert programmer so she developed a program called kSNP that could identify those homologous SNPs by searching the genome sequences. The program not only eliminated the need for alignment it also calculated phylogenetic trees; i.e. it completely met her needs. She shared the kSNP program with a few friends who found it so useful that they insisted that she publish a paper about it (Gardner SN, Slezak T (2010) Scalable SNP analyses of 100+ bacterial or viral genomes. J Forensic Science 1: 107–111).


The paper immediately fell into a black hole of obscurity and, because the journal was virtually unknown and not read by anyone in bioinformatics or related fields, was cited only about 8 times in the subsequent two years. In early 2013 Dr. Barry Hall, Director of the Bellingham Research Institute, came upon one of those papers that mentioned genome analysis without alignment and, intrigued, he obtained a copy of the paper. Realizing the importance of this paper to the field he contacted Dr. Gardner, got from her a link, and downloaded the program. It immediately became obvious why kSNP was not widely used: (1) the documentation consisted of just two terse pages, (2) only a really expert programmer could apply kSNP because using it required modifying the source code and (3) the program consisted of over 45 individual programs that were controlled by a BASH script. Very few people were qualified to use kSNP.

Dr. Hall contacted Dr. Gardner again and offered (1) to write detailed documentation and (2) to compile the individual programs so that users need have no programming skills to use kSNP. Dr. Gardner enthusiastically accepted his offer and suggested that since she was already working on a new version they work together on kSNP2. On April 2, 2003 Dr. Hall and Dr. Gardner agreed to collaborate on producing a well documented, user friendly updated kSNP2. The goal was to provide a program that could be used by beginning graduate students with no programming or bioinformatics experience. Within two months they released kSNP2 on SourceForge and by July 16, 2003 they submitted a paper describing kSNP2 to PLOS One (Gardner, S.N. and Hall, B.G. 2013. . PLoS ONE, 8(12):e81760.doi:10.1371/journal.pone.0081760). That paper was accepted in Mid October and was quickly embraced by the Bioinformatics community and within two years had been cited over 200 times.

They soon began work on kSNP3, an improved version of kSNP2 that was significantly faster then kSNP2 and featured annotation of the SNPs. Annotation provided for each SNP the gene in which it occurred and the position in that gene, the protein that was encoded, the amino acid in which the SNP occurred and whether the SNP changed that amino acid. kSNP3 was released in February of 2015 and the paper describing kSNP3 (Gardner, S.N., T. Slezak, and B.G. Hall. 2015 Bioinformatics 31: 2877-2878 doi: 10.1093/bioinformatics/btv271) was published in April 2015. kSNP3 has been downloaded over 7800 times and the paper has been cited over 450 times. Exactly as Dr. Hall anticipated in 2013 kSNP has become a mainstay of bioinformatics and is taught in many Bioinformatics courses throughout the world.

Shortly after that paper appeared tragedy struck. Shea contacted Dr. Hall to inform him that she had been diagnosed with terminal stage 4 lung cancer.  Shea was a vibrant, vital lady who had never smoked and whose idea of a good time was a 10 mile hike with a 600 foot elevation gain. Shea asked Dr. Hall to take over full responsibility for curating kSNP3 -which he, of course, agreed to do. Dr. Shea Gardner, born March 3, 1969 died on February 21, 2016.

In 2017 NCBI discontinued the use of gi numbers in its sequence files, including genome sequence files. kSNP3 depended upon those gi numbers to retrieve the information required to annotate SNPs. As a result the annotation feature of kSNP3 was lost. Dr. Gardner could certainly have quickly fixed the problem, but Dr. Hall lacked the programming expertise to do so. Nevertheless kSNP3 continued to enjoy robust use for its primary purpose of phylogenetic analysis of whole genome sequences.

In June of 2022 Jeremiah Nisbet was visiting Dr. Hall who told him the story above.  Jeremiah is an expert, experience programmer and soon offered to try to fix the annotation problem. Not only did he identify and fix that problem he rewrote major portions of kSNP3 and on October 26, 2022 they released kSNP4. kSNP4 restores the full annotation feature and is about 2.5 times faster than kSNP3. kSNP4, with updated documentation, is freely available at https://sourceforge.net/projects/ksnp/files/.



Categories: Product Info

Tags: ,

Leave a Reply

%d bloggers like this: