Genome-scale sequence typing data investigating the Rhizopogon-Pseudotsuga ectomycorrhizal symbiosis
datasetposted on 19.07.2019 by Alija Mujic, Bo Huang, Mingjun Chen, Pi-Han Wang, David Gernandt, Kentaro Hosaka, Joseph Spatafora
Datasets usually provide raw data for analysis. This raw data often comes in spreadsheet form, but can be any collection of data, on which analysis can be performed.
In this study we analyze the global phylogeography of the Rhizopogon-Pseudotsuga ectomycorrhizal symbiosis to investigate the potential for comigration of ectomycorrhizal fungi and their hosts. We have developed a novel data mining technique, genome-scale sequence typing (GSST), which we have used to develop a phylogenetic dataset from unannotated low coverage genome assemblies. GSST uses free and open source software packages in conjunction with custom BioPerl scripts. Using this method we identified 989 single copy protein-coding loci for 36 Rhizopogon taxa collected in association with Pseudotsuga host trees throughout the natural range of the hosts. Nucleotide alignments were concatenated into a single super alignment and phylogenetic analysis was carried out using the maximum likelihood algortihm implemented in RAxML (100 bootstrap replicates). Individual alignments were analyzed in RAxML (100 bootstrap replicates) and resulting phylogenetic trees were analyzed in MP-EST to infer the best species tree summarizing these gene trees. The concatenated analysis and the MP-EST analysis inferred very similar species trees which both support a single evolutionary origin of the Rhizopogon-Pseudotsuga symbiosis and comigration between ECM symbionts. This dataset includes nucleotide sequence alignments produced through GSST, both the concatenated sequence alignment and alignment files for the 963 loci included in the MP-EST analysis. Resulting tree files inferred by RAxML and MP-EST are included for reference as well. A locus coordinate file in .gff3 format is also included which can be used to directly inspect nucleotide regions mined from genome assemblies (assemblies are deposited at NCBI). Custom BioPerl scripts were written to process output of software packages and also to mine nucleotide data from low-coverage genome assemblies. These scripts are included here in their original format and provided free of use under the Apache 2.0 software license. Notes on using these scripts are supplied in the usage notes within each script and also in the associated publication "Out of Western North America: evolution of the Rhizopogon-Pseudotsuga symbiosis inferred by genome-scale sequence typing". Please note, these scripts were written to take genome assembly files in fasta format as generated by VELVET, output from FastOrtho in .end plain text format, and BLAST reports in .xml format as generated by legacy BLAST software from NCBI. The algorithms implemented in these scripts will work with other related input formats and the author is happy to work with anyone who wishes to apply these scripts to their research.