Statistical Learning with Phylogenetic Network Invariants
Abstract
Phylogenetic networks provide a means of describing the evolutionary history of sets of species believed to have undergone hybridization or gene flow during their evolution. The mutation process for a set of such species can be modeled as a Markov process on a phylogenetic network. Previous work has shown that a site-pattern probability distribution from a Jukes-Cantor phylogenetic network model must satisfy certain algebraic invariants. As a corollary, aspects of the phylogenetic network are theoretically identifiable from site-pattern frequencies. In practice, because of the probabilistic nature of sequence evolution, the phylogenetic network invariants will rarely be satisfied, even for data generated under the model. Thus, using network invariants for inferring phylogenetic networks requires some means of interpreting the residuals, or deviations from zero, when observed site-pattern frequencies are substituted into the invariants. In this work, we propose a method of utilizing invariant residuals and support vector machines to infer 4-leaf level-one phylogenetic networks, from which larger networks can be reconstructed. The support vector machine is first trained on model data to learn the patterns of residuals corresponding to different network structures to classify the network that produced the data. We demonstrate the performance of our method on simulated data from the specified model, a network model that includes the multispecies coalescent process, and primate data.
Keywords: coalescent, statistical phylogenetics, phylogenetic networks, algebraic statistics
How to Cite:
Barton, T., Gross, E., Long, C., Rusinko, J. & Gross, E., (2026) “Statistical Learning with Phylogenetic Network Invariants”, Bulletin of the Society of Systematic Biologists 4(1). doi: https://doi.org/10.18061/bssb.5822
Downloads
Download PDF