A computer-implemented method and a computer
system for identifying a
phylogenetic tree from a plurality of biological sequences is provided. Each biological sequence is associated with a sampling date. First, the plurality of biological sequences is aligned and a
distance matrix is obtained. Then, a subset of these sequences without any duplicated sequences is selected and a
directed graph representation of the subset of biological sequences is generated based the associated sampling dates. Then, a
minimum spanning tree is computed from the weighted
directed graph representation. Then, in an iterative procedure, the sequences of unsampled evolutionary intermediates are inferred from
mutation patterns that reflect the difference in sequence between the nodes in the
minimum spanning tree. The new sequences are added with associated time stamps to the sequence set. Then, sets of identical sequences are removed. Then, an optimum branching is recomputed. This step is repeated until no new intermediates are found. In the final step, the sequences that have been
set aside in the initializing step are added to the plurality of sequences derived in the update step. From this plurality of sequences an optimum branching is computed and identified as the
phylogenetic tree.
Amino acid changes repeatedly occurring on the internal branches of the obtained tree can be used to identify sequences and associated viral isolates suitable as vaccine strains for the following influenza season.