The Pfaffian Structure of CFN Phylogenetic Networks

This page contains auxiliary files to the article:
Joseph Cummings, Elizabeth Gross, Benjamin Hollering, Samuel Martin, Ikenna Nometa: The Pfaffian structure of CFN phylogenetic networks

Abstract: Algebraic techniques in phylogenetics have historically been successful at proving identifiability results and have also led to novel reconstruction algorithms. In this paper, we study the ideal of phylogenetic invariants of the Cavender-Farris-Neyman (CFN) model on a phylogenetic network with the goal of providing a description of the invariants which is useful for network inference. It was previously shown that to characterize the invariants of any level-1 network, it suffices to understand all sunlet networks, which are those consisting of a single cycle with a leaf at each vertex. We show that the parameterization of the affine open patch of the CFN sunlet model, which intersects the probability simplex factors through the space of skew-symmetric matrices via pfaffians. We then show that this affine patch is isomorphic to a determinantal variety and give an explicit Groebner basis for the associated ideal which involves only polynomially many coordinates. Lastly, we show that sunlet networks with at least 6 leaves are identifiable using only these polynomials and run extensive simulations, which show that these polynomials can be used to accurately infer the correct network from sequence data. |

Supplementary Material

In Section 3 of our paper we propose a generating set \(G_n\) for our ideal \(I_n\) of phylogenetic invariants of the CFN model on a \(n\)-leaf sunlet network in our new coordinate system. We show that if \(G_n\) is a Groebner basis for the ideal which it generates for all \(n\) such that \(5 \leq n \leq 11\) then \(G_n\) is a Groebner basis for all \(n\). The file PfaffianPhyloNets.m2 contains \(\verb|Macaulay2|\) code which shows that this does indeed hold for \(5 \leq n \leq 11\) which proves Lemma 3.7 of our paper.

In Section 4 we demonstrate how to use our results to perform sunlet network inference. For this section we created two Python scripts, evaluate_pfaffians.py and simulateSunletSequenceK3P.py. These both require python version 3.8 or later, and the python packages scikit-bio and mpmath. Both packages can be installed via pip

pip install scikit-bio
pip install mpmath

simulateSunletSequenceK3P.py

This script simulates a multiple sequence alignment generated from a leaf-labeled \(n\)-sunlet under the Kimura 3-parameter substitution model. Leaf-labelings are denoted \((x_1, x_2, \ldots, x_n)\), where the label \(x_1\) is assigned to the leaf below the reticulation vertex, and the remainder assigned to the leaves in a clockwise fashion around the sunlet. The script simulates sequences from the sunlet with leaf-labeling \((1,2,\ldots, n)\). Substitution parameters for each edge are randomly generated to be close to the matrix with 0.95 along the diagonal and off-diagonals distributed uniformly.

python simulateSunletSequenceK3P.py -n <number of leaves> -l <MSA length> -s <seed> -g <gamma parameter> -o <output filename>

        -l <MSA length>         Length of MSA to output
        -s <seed>               Integer value for seeding random numbers
        -g <gamma parameter>    Floating point value giving the probability of a site evolving along reticulation edge e in the network.
        -n <number of leaves>   Positive integer giving the number of leaves in the sunlet.
        -o <output filename>    Filename for output MSA in phylip format

Here, the gamma parameter is the proportion of sites simulated along one of the two trees displayed by the sunlet, and defaults to 0.5.

evaluate_pfaffians.py

This script takes an input multiple sequence alignment file from sequences generated on a sunlet network (such as those produced by simulateSunletSequenceK3P.py), and, using the methods developed in the paper, produces a score for each possible leaf-labeling of that sunlet. The labeling with the lowest score is presumed to be the sunlet that generated the data.

python evaluate_pfaffians.py -a <MSA file> -t <num threads>

        -a <MSA file>           Multiple sequence alignment file.
        -t <num threads>        Number of threads to use for parallel processing.

Example Usage

The following commands simulate a multiple sequence alignment from a 6-sunlet of length 1kbp, and then evaluate this sunlet using our method.

python /path/to/SimulateSunletSequenceK3P.py -n 6 -l 1000, -s 1 -o ./msa.phylip
python /path/to/evaluate_pfaffians.py -a ./msa.phylip -t 4

All of the above code can be found in the following zip file.


Credits
Project page created: 7/12/2023
Project contributors: Joseph Cummings, Elizabeth Gross, Benjamin Hollering, Samuel Martin, Ikenna Nometa
Software used: Macaulay2 (Version 1.22)
System setup used: MacBook Pro with macOS Ventura 13.5.2, Apple M2 Processor, Memory 16 GB
Corresponding authors of this page: Benjamin Hollering (benhollering@gmail.com) and Samuel Martin (Samuel.Martin@earlham.ac.uk)