PhyloMatch 1.0

The quest for genes representing genetic relationships of strains or individuals within populations and their evolutionary history is acquiring a novel dimension of complexity with the advancement of next-generation sequencing (NGS) technologies. In fact, sequencing an entire genome uncovers genetic variation in coding and non-coding regions and offers the possibility of studying Saccharomyces cerevisiae populations at the strain level. Nevertheless, the disadvantageous cost-benefit ratio (the amount of details disclosed by NGS against the time-expensive and expertise-demanding data assembly process) still precludes the application of these techniques to the routinely assignment of yeast strains, making the selection of the most reliable molecular markers greatly desirable. In this work we propose an original computational approach to discover genes that can be used as a descriptor of the population structure. We found 13 genes whose variability can be used to recapitulate the phylogeny obtained from genome-wide sequences. The same approach that we prove to be successful in yeasts can be generalized to any other population of individuals given the availability of high-quality genomic sequences and of a clear population structure to be targeted.

PhyloMatch is the name of the package developed for the S.cerevisiae phylogenomics study published on Nucleic Acids Research 2012 by Ramazzotti et al. It is composed by several perl script plus a configuration file. The main script phylo_match.pl is the one you will have to launch, all the others are used by it.

In the script folder, you will need to create a folder named “genomes” containing a number of fasta formatted files (one per organism in analysis) containing coding sequences of genes.

The gene names present in the header of each fasta sequence will have to be shared across the organisms, i.e. if a gene in a organism is called YOR133W, it must have the same names in all the files.

We set up a couple of scripts that helps creating such files, but you can use your preferred method. Just ensure that the gene names will be shared.
Alternatively, the pre-formatted genomes are available upon request.

1) PhyloMatch Download:

a) Download and save phylomatch.zip
b) Extract phylomatch.zip file in your preferred destination folder
c) Create the genomes folder where to place the formatted genomes to be analysed

2) Software required

a) Perl installed and in path
b) Phylip installed and in path
c) ClustalW installed and in path

3) perl modules to be installed

a) Bio::TreeIO;
b) IO::String;
c) Parallel::ForkManager;

4) basic usage (see the PhyloMatch documentation here)

a) Edit the phylo_match.conf file and adapt it to your needs. Then save it.
b) Run the analysis by typing perl phylo_match.pl
c) prepare the target file containing the names to be matched
d) When the analysis is finished, launch the screening by typing phylo_screen.pl cluster_folder target_file

An example target file use in the NAR paper is available here.