A preliminary look at host association patterns in Trebouxia

Having gone through the steps to build a phylogeny of the most common lichen photobiont, Trebouxia in my last post, I will now go on to discussing the host association patterns that it reveals. Here is the Trebouxia ITS tree generated previously:

Trebouxia ITS phylogeny. Major clades are differentilly coloured and named according to authentic strains

Trebouxia ITS phylogeny. Major clades are differentilly coloured and named according to authentic strains

I’ve coloured all of the taxa within clades according to the colours of the named strains and I’ve assigned unique colours to each clade that does not contain named strains. I have not attempted to break up T. jamesii or T. impressa into sub-clades, though doing so would probably be justified. This will be a topic for a future post. I should also point out that T. jamesii is referred to as T. simplex is some papers.

In contrast to Nostoc photobionts where the fasta headers were consistently labeled with the host information, these sequences are …not. I used a bioperl wrapper to NCBI’s Eutils interface to download genbank format sequences and parsed them to extract host association information from the “host”, “note” and “isolation source” annotions. I also extracted information about the author of each sequence and where it was published while I was at it:

../Scripts/GetGB.pl Trebouxia_ITS_acc.txt heath.obrien-at-gmail-dot-com > Trebouxia_ITS.gbk

../Scripts/ParseHost.pl Trebouxia_ITS.gbk > host_info.txt

../Scripts/GetRef.py Trebouxia_ITS.gbk >ref_info.txt

This information was added to Trebouxia_ITS_metadata.txt in Excel and missing values were filled in manually where possible. I also added culture collection numbers where available and started to fill in locality information, but I haven’t gotten very far with the latter.

Next, I added information about which clade each sequence fell into to the Trebouxia metadata file. The ETE python toolkit was invaluable for this step and was my main proximate motivation for switching from perl to python for my scripting, but I was also really, REALLY tired of having to keep my ‘\@’s and ‘%{$’s straight:

../Script/GetClades.py -t Trebouxia_ITS.nwk -m Trebouxia_ITS_metadata.txt >temp

mv temp Trebouxia_ITS_metadata.txt

In this case, the information was added to the metadata file automatically.

Lastly, I wrote a script to count the number of times that Trebouxia from each clade was associated with each lichen genus that has been sampled:

../Scripts/CountAssociations.py -m Trebouxia_ITS_metadata.txt > AssocaitionCounts.txt

After some fiddling with conditional formatting in Excel, these analyses produced this:


Counts of associations between Trebouxia clades and lichen genera. Colour coding matches the phylogeny

Before discussing the patterns, I should point out that these counts are of how many sequence records have been deposited in genbank, not the number os specimens that have been sampled as many authors deposit representative haplotype sequences rather than all of their data. Incorporating this information will change the counts dramatically in some cases.

The most extensively sampled genus is Letharia, which is exclusively associated with Trebouxia jamesii. In fact, there appears to be strong reciprocal specificity acting between species of Letharia and subclades within T. jamesii, a topic I would like to explore further in the future.

Eight other genera in the Parmeliaceae are also exclusively associated with T. jamesii photobionts, including Cetraria (76 sequences), Evernia (19 sequences), Flavocetraria (17 sequences), Hypogymnia (12 sequences) and Pseudevernia (10 sequences). However, Parmelia (10 sequences), Flavoparmelia (6 sequences), and 4 other genera associate with photobionts in the T. impressa/T. gelatinosa clade (among others) and Parmotrema is highly specific for T. corticola photobionts, with 135 P. tinctorum photobionts grouping with T. corticola (the Trebouxia sp. 3 photobiont is from a different Parmotrema species). The P. tinctorum / T. corticola association is another case of reciprocal specificity as 135 of 141 T. corticola sequences are from P. tinctorum photoboints.

All photobionts from Lasallia (28 sequences) and most from Umbilicaria (105 of 131 sequences) also grouped with T. jamesii, with most of the exceptions being specimens collected in the Antarctic (see also this paper). T. jamesii was also the predominant photobiont of Thamnolia (22 out of 28 sequences) and Chaenotheca (7 of 8 sequences). T. jamesii was also a common photobiont of Lecanora and Lecidea, but photobiont diversity in both of these genera, and the Lecaonraceae in general, is extremely high, with  Lecanora photobionts falling into 11 different species and Lecidea photobionts falling into 8. Indeed, 6 species of photobiont have been recovered from L. rupicola alone. Lichens in the Lecanoraceae are the hosts for the vast majority of T. asymmetrica, T. incrustata, T. showmanii and T. sp. 1 photobiots that have been sampled.

Similar to the Parmeliaceae, most of the genera in the Physciaceae are specific for the same Trebouxia lineage, while other genera do not associate with it at all. Physconia (40 sequences) and Phaeophyscia (7 sequences) are exclusively associated with T. impressa, as are 29 out of 36 Physcia sequences while only 2 of 5 Rinodina photobionts and none from Anaptychia (8 sequences) group with T. impressa.

All but 9 of 133 Xanthoria photobionts belong to T. decolorans or T. arboricola, which are  sister species. These photobionts also predominate in most of the other genera in the  Teloschistaceae, including 53 of 65 Caloplaca photobionts, 4 of 4 Huea photobionts, and 34 of 68 Tephromela photobionts, a genus that is also frequently associated with Trebouxia sp. 2 photobionts.

Specificity is also high for Ramalina, with 139 of 150 photobiont sequences restricted to two Trebouxia clades. Boreoplaca is associated with one of the same photobionts (Trebouxia sp. 2), but not with the most common Ramalina photobiont (T. decolorans; 116 of 150 sequences).

Thus, there is a wide range of association patterns, from extreme reciprocal specificity (135 of 141 T. corticola sequences associated with Parmotrema and all P. tinctorum specimens associated with T. corticola) to generalism (Lecanora rupicola associating with  6 different Trebouxia species). There is some evidence of phylogenetic inertia, as lichen genera from the same family are more likely to share similar photobiont association patterns than unrelated lichens, but there is also a lot of plasticity. There are a lot of ideas out there about the ecological and life history factors that could be causing these differences, but given the complexity of the patterns and our lack of knowledge of things like lichen demography and dispersal mechanisms, it will probably be some time before definitive explanations can be provided.

Heath O’Brien (2013). A preliminary look at host association patterns in Trebouxia PhotobiontDiversity.wordpress.com : http://dx.doi.org/10.6084/m9.figshare.711786

Posted in Green Algal Photobionts | Tagged | 1 Comment

Green Algal Photobionts: Trebouxia

Having beaten the phylogeny of symbiotic cyanobacteria into submission in my previous post, I am now tackling the green algae. My plan was to start with a big-picture analysis of 18S ribosomal RNA sequences, but my initial blast search returned over 10,00o 454 reads from metagenomic projects which was a lot more “environmental isolate XXX” than I felt like dealing with. Besides, I don’t know that I could add much to this recent overview. Therefore, I am going to focus on the most important lineage of lichenized algae: Trebouxia. There have been a large number of studies that have obtained photobiont ITS sequences from a variety of Trebouxia associated lichens, so these are the data that I looked at.

The methods are the same as the ones that I described in detail previously for Nostoc ITS sequences. Briefly, I used two ITS sequences (T. impressa JN204819 and T. arboricola JQ993781) as queries to identify all homologous (E-value <= 1e-100) sequences in the nt database. Sequences were aligned with MAFFT, duplicate sequences were removed with MetaPIGA, alignment positions corresponding to gaps in the references sequence (T. arboricola JQ993758) were removed with trimal, and phylogenetic relationships were inferred with PhyML.

This procedure produced a tree with 794 taxa representing 1840 Trebouxia ITS sequences. The actual number of Trebouxia associated lichens that have been sequenced is much higher than this because many authors only deposit representative sequences of each haplotype that they obtained. At some point I will dig into the papers where this has been done to extract the real numbers, but I have not done so yet.

For now, I am going to focus on the taxonomy of the algae. I will leave a discussion of the host-association patterns for a future post. Here is the Trebouxia ITS phylogeny color-coded by species (tree file can be found here):

Trebouxia ITS phylogeny

Trebouxia ITS phylogeny color-coded by species (dark green: T. jamesii, yellow: T. corticola, light green: T. incrustata, brown: T. asymmetrica, ornage: T. gigantea, purple: T. gelatinosa, dark blue: T. impressa, light red: T. arboricola, light blue: T. decolorans, dark red: other, grey: T. sp.). Sequences recovered from multiple named species are in black. Black circles indicate aLRT support > 0.9

With a few exceptions, sequences from named algae tend to cluster very well. T. gelatinosa (purple) is nested within T. impressa (dark blue), though given the long branch separating these two species from all of the others, I don’t entirely trust the rooting of this clade. T. jamesii (dark green) is a very heterogeneous group as has been recognised previously. A number of photobionts that group with T. decolorans (light blue) have been identified as T. arboricola (light red). Three major lineages have no named members (except for some presumably misidentified T. decolorans sequences).

In addition to the differentially coloured species, there are several additional species names that are represented by a small number of sequences, all of which are colored dark red in the tree. T. australis, T. brindabellae, T. showmanii and T. usneae are each found in distinct clusters and are likely to represent additional good species. T. australis and T. brindabellae are both in clusters near the base of one of the T. jamesii clades (dark green). Two T. showmanii sequences form the sister group to T. incrustata (light green). T. usneae forms a distint lineage with a misidentified T. corticola sequence sister to the T. corticola lineage (yellow). All other rare species are deeply nested within other common species and appear not to be distinct. These include T. potteri which is nested within T. impressa (dark blue), T. aggregata and T. crenulata which are nested within T. arboricola (light red) and T. simplex, which includes six sequences that are identical to T. jamesii (black) and two other sequences that are nested within one of the T. jamesii clades (dark green). T. flava is identical to a T. impressa sequence and is coloured black in the tree.

In conclusion, >1840 Trebouxia ITS sequences that have been obtained from lichens cluster into about 24 distinct species, 13 of which appear to have suitable named representatives in the database. Two of the T. jamesii clusters have been given the provisional names T. “vulpinea” and T. “letharii” but it looks like at least three additional names are needed for this group.

That’s it for now. In my next post I will map host information onto this tree.

How to cite

Posted in Green Algal Photobionts | Tagged , , , | 1 Comment

An In-Depth Look at the Diversity of Symbiotic Nostoc

**Post has been updated with some corrections to the host information in the first phylogeny**

Today I am finally going to take a detailed look at the Nostoc phylogeny that I have been working on. But before I can begin, I have to figure out a way to highlight interesting taxa in an automated way. To do this, I wrote a script that adds html color tags after taxon names according to various classifications. While I was at it, I converted the branch support values to a binary system (≥0.9 vs. <0.9), which I can display as black circles on significantly supported branches. Note that this script requires that the tree be in NEXUS  format rather than the plain Newick that is produced by PhyML. Opening the tree file in FigTree and saving it converts it to NEXUS, or the conversion could be scripted using Bioperl.

First, lets compare lichen photobionts to other free-living and symbiotic Nostoc strains:

../Scripts/ColourTree.pl ../Nostoc_rbcX_metadata.txt ../Nostoc_rbcX_host.nwk host >host_tree.nwk

Nostoc rbcX phylogeny, coloured by type of association (purple: lichen photobionts, green: plant symbionts, blue: free-living, red: fungal endosymbiont). Names in black indicate genotypes found in more than one group. Circles on internal nodes indicate aLRT ≥0.9.

Nostoc rbcX phylogeny, coloured by type of association (purple: lichen photobionts, green: plant symbionts, blue: free-living, red: fungal endosymbiont). Names in black indicate genotypes found in more than one group. Circles on internal nodes indicate aLRT ≥0.9.

As mentioned last time, the earliest branching taxa are free-living Nostoc isolates, along with a culture isolated from Peltigera, which I suspect may not be a true photobiont. There are also other free-living strains throughout the rest of the phylogeny that have been identified as N. edaphicum, N. calcicola, N. commune, N. muscorum and N. flagelliforme. Cyanobacterial taxonomy is a mess, but that is a topic for another day. There are also symbionts from a variety of plant groups throughout the main crown group including Cycads (Cycas, Macrozamia and Encephalartos), Bryophytes (Blassia and Anthoceros) and the angiosperm Gunnera. There are two cases where lichen photobionts are identical to plant symbionts (coloured black in the tree). Finally, there are two symbionts from Geosiphon pyriforme, a weird unicellular primative fungus, that hosts intracellular symbionts in sac-like, multinucleate cells (coloured red). There is some debate as to whether this symbiosis should be classified as a lichen or not.

Next, we can look at photobionts of different lichen families (the taxonomy of the lichen is based on that of the fungal partner):

../Scripts/ColourTree.pl ../Nostoc_rbcX_metadata.txt ../Nostoc_rbcX_host.nwk family >family_tree.nwk

Nostoc rbcX phylogeny, coloured by host family (purple: Stereocaulaceae, green: Lobariaceae, blue: Peltigerales, red: Collemetaceae, yellow: Nephromataceae, brown: Pannariaceae). Names in black indicate genotypes found in more than one group. Names in grey indicate non-lichenized strains. Circles on internal nodes indicate aLRT ≥0.9.

Nostoc rbcX phylogeny, coloured by host family (purple: Stereocaulaceae, green: Lobariaceae, blue: Peltigerales, red: Collemetaceae, yellow: Nephromataceae, brown: Pannariaceae). Names in black indicate genotypes found in more than one group or photobionts of lichens with uncertain taxonomic position (Massalongia). Names in grey indicate non-lichenized strains. Circles on internal nodes indicate aLRT ≥0.9.

At the deepest nodes in the tree, there is clearly a lot of host switching between different lichen families, but there is a lot of clustering of photobionts from the same lichen family at the tips of the tree. Photobionts of lichens in the Lobariaceae, Nephromataceae and Pannariaceae are all mixed up, which has been noted previously and has been proposed to reflect the ecological similarities of the hosts. There also appears to be a lot of historic photobiont sharing between lichens in the Peltigeraceae and the Collmenataceae, but such sharing is not ongoing as in all cases there are long branches separating photobionts of these families. Stereocaulon is the only species represented in the tree that is not part of the Peltigerales, an order of lichens that are universally associated with Nostoc, either as the sole photosynthetic partner or as a secondary photobiont. It would be interesting to see if other non-Peltigeralean lichens also associate with such divergent Nostoc genotypes.

Lastly, let’s take a look at species-level patterns. There is a lot of host switching among members of the same genus, but there do appear to be some species that are highly specialised:

../Scripts/ColourTree.pl ../Nostoc_rbcX_metadata.txt ../Nostoc_rbcX_host.nwk specialists >specialist_tree.nwk

Nostoc rbcX phylogeny, coloured by host species Names in grey indicate non-specialist hosts. Circles on internal nodes indicate aLRT ≥0.9.

Nostoc rbcX phylogeny, coloured by host species Names in grey indicate non-specialist hosts. Circles on internal nodes indicate aLRT ≥0.9.

With the current sampling, it is possible to identify four species of Leptogium and one each of Collema, Peltigera and Sticta that exclusively associate with a single cluster of photobionts, which is, in turn, exclusively associated with that lichen species (reciprocal specificity). Note that there is one P. malacea photobiont that falls out of the P. malacea cluster, but there are about four times as many specimens of this species as there are for any of the other specialists. As noted previously, these specialists predominate in the basal symbiotic Nostoc lineage.

There is a lot more that could be said about this tree, but I think I’ll leave it there for now. See this paper for a more detailed analysis of the complex photobiont specialisation patterns in Peltigera, including geographic patterns. On to the green algal photobionts in my next post…

How to cite

Posted in Cyanobacterial photobionts | Tagged | 4 Comments

Adding new sequences (Mistakes were made)

Looking through the tree produced in my last post, I noticed that several interesting sequences were missing from the tree. There are also less sequences in the tree than I get if I search for “Nostoc rbcX” in Entrez. Turns out that this is because blast+ limits the number of results returned to 500 by default. In retrospect, the fact that I ended up with exactly 500 sequences should have been a red flag. Fortunately, blast+ includes an option to exclude sequences from the results by GI number.

First I will move all of the old files to a separate folder and make a new folder for the new files:

mkdir 130429 130502

mv Nostoc_rbcX* 130429/

Extract GIs from sequence headers and write to a file:

grep “>” 130429/Nostoc_rbcX.fa | perl -p -e ‘s/>gi\|(\d+)\|.*/$1/’ >130429/Nostoc_rbcX_gis.txt

Repeat blast search using the -negative_gilist option:

blastn -task megablast -query Ncommune_rbcX.fa -db nt -evalue 1e-20 -outfmt ‘6 qseqid qlen sacc slen pident length mismatch gapopen qstart qend qframe sstart send sframe evalue bitscore’ -negative_gilist 130429/Nostoc_rbcX_gis.txt -out 130502/Nostoc_rbcX.bl

This returns 430 additional results, but the sequence similarity starts to drop off rather quickly, with only 19 sequences with an Evalue of less than 1e-180. The top hits below this value are all Anabaena sequences, so it is probably safe to use this as a cutoff for now.

Repeat blast search with a more stringent evalue cutoff:

blastn -task megablast -query Ncommune_rbcX.fa -db nt -evalue 1e-180 -outfmt ‘6 qseqid qlen sacc slen pident length mismatch gapopen qstart qend qframe sstart send sframe evalue bitscore’ -negative_gilist 130429/Nostoc_rbcX_gis.txt -out 130502/Nostoc_rbcX.bl

Obtaining the sequences, trimming, reverse-complementing, and renaming proceeds as described previously.

Before adding the new sequences to the alignment, it will be helpful to remove the redundant sequences so that the groupings do have to be recalculated in MetaPIGA:

cat 130429/Nostoc_rbcX_metadata.txt | Scripts/RemoveRedundant.pl 130429/Nostoc_rbcX_aln.fa >Nostoc_rbcX_nr_aln.fa

Add new sequences to the alignment:

 mafft –add 130502/Nostoc_rbcX_filtered.fa Nostoc_rbcX_nr_aln.fa > Nostoc_rbcX_aln.fa

After applying automated filtering in MetaPIGA and building a tree as described previously, I got a tree that is very different from the one produced last time. Comparing the alignment files from the two analyses, 288 ambiguously aligned positions that were trimmed previously were left in this time. MetaPIGA uses the “gappyout” method from TrimAl for automated alignment trimming. This method appears to be highly sensitive to which sequences are included. In order to keep things consistent when I add new taxa, I am going to manually apply an exclusion set. Because the coordinates of the exclusion set may change as the dimensions of the alignment change, I selected a sequence without any insertions to act a reference. I then used the gap positions in this sequence as an exclusion set for the “-select” method in the command line version of TrimAl.

Remove redundant sequences:

Open Nostoc_rbcX_aln.fa in MetaPIGA, then select Dataset->”Check for Ambiguous sequences.” Copy resultant groupings to 130502/Nostoc_rbcX_groups.txt and save the modified dataset as Nostoc_rbcX.nex

Trim ambiguous regions:

Scripts/GetExcluded.pl Nostoc_rbcX_aln.fa |pbcopy

trimal -in Nostoc_rbcX.nex -phylip -select `pbpaste`  >Nostoc_rbcX.phy

This uses the very cool pbcopy/pbcopy commands to pipe output from the first command to the clipboard so it can be pasted into the command line for the second command. These commands are OSX specific, but there are similar commands available in other flavours of unix.

Build tree:

phyml -i Nostoc_rbcX.phy

mv Nostoc_rbcX.phy_phyml_tree.txt Nostoc_rbcX.nwk

Next, I need to integrate the new groups of redundant sequences with the old set. This information also needs to be added to the Nostoc_rbcX_metadata.txt file. I started by parsing the fasta headers as previously to extract host information, then modified the AddGroups.pl script to compare the new list of redundant sequence groups to the previous list and to add new sequences to the old groups if necessary:

Scripts/ParseHost.pl < 130502/Nostoc_rbcX.fa > 130502/Nostoc_rbcX_metadata.txt

 cat 130429/Nostoc_rbcX_metadata.txt 130502/Nostoc_rbcX_metadata.txt | Scripts/AddGroup.pl Nostoc_rbcX.nwk 130502/Nostoc_rbcX_groups.txt > Nostoc_rbcX_metadata.txt

Lastly, I added the host info from the metadata file to the tree and visualize with FigTree:

Scripts/AddHost.pl Nostoc_rbcX_metadata.txt < Nostoc_rbcX.nwk >Nostoc_rbcX_host.nwk

Which produces this tree. Because this is a more conservative exclusion set, the tree looks quite different from the one I produced previously, but if I reanalyse that dataset using this exclusion set, I get a very similar topology.

A few quick observations about this tree for now:

The most basal lichen photobiont sequence (DQ185264 from Peltigera didactyla) is from a culture and may represent an epiphyte or accessory symbiont, rather than the true photobiont.

The next branch includes all sequenced photobionts of Peltigera malacea, Leptogium lichenoides and Sticta hypochra, each of which forms a more-or-less distinct cluster:


The much-discussed Nephroma guild is not recovered as monophyletic but forms a paraphyletic grade at the base of the core photobiont lineage (I suspect that this may be an artifact of some kind):


There is one Stereocaulon photobiont, which is on an extremely long branch, that groups with four divergent Peltigera neopolydactyla photobionts, though there are other P. neopolydactyla photobionts throughout the tree:


There are plenty of other interesting observations to be made about this tree. More soon.

Posted in Cyanobacterial photobionts | Tagged | 2 Comments

Adding host information to the Nostoc phylogeny

Having obtained 496 Nostoc rbcX sequences (plus one outgroup) and used them to infer a reasonable phylogeny, all that is left is to assign host association information to the branches. This will require (a) parsing the sequence files to obtain host info for each sequence, (b) associating each non-redundant sequence in the tree with the host info for all identical sequence, and (c) displaying all of this information on the tree.

The first step turned out to be surprisingly straightforward. The sequences are so consistently named that I was able to apply a “simple” regular expression to extract the host info from the fasta headers of the sequences:

./ParseHost.pl < Nostoc_rbcX.fa >Nostoc_rbcX_metadata.txt

This extracted the host genus and species names for all 459 lichen photobionts obtained from direct sequencing (though I had to edit a few of them to remove some extraneous information). The remaining 38 sequences were obtained from cultures (or directly from cyanobacterial colonies) and do not follow the same naming convention, even though a number of the cultures were derived from photobionts. Fortunately, most of these are from a single paper, so it is fairly easy to extract the info from the tables in the paper.

The next problem is adding multiple host names to the branches representing redundant sequence group. I copied the info about redundant sequences from MetaPIGA into the file Nostoc_rbcX_groups.txt. The names in this file are unwieldy because they contain the entire fasta header for each sequence. The first thing I will do is remove everything but the accession number from these names. This is a bit tricky because there are multiple names per line:

 perl -p -e ‘s/gi\|/|/’ Nostoc_rbcX_groups.txt | perl -p -e ‘s/\|\d+\|\w+\|(\w+)\.\d\|[^\|\n]+/$1, /g’ > temp.txt

mv temp.txt Nostoc_rbcX_groups.txt

I will next add this group information to the metadata table:

cat Nostoc_rbcX_metadata.txt | ./AddGroup.pl Nostoc_rbcX_groups.txt Nostoc_rbcX.nwk >temp.txt

mv temp.txt Nostoc_rbcX_groups.txt

This script also outputs information about if each sequence is present in the tree so that I can tell which sequence represents each group. MetaPIGA assigns short sequences that match multiple different longer sequences to multiple groups. I assigned these ambiguous sequences to the most common matching sequence type.

Next, I add the host info to the tree for the unique sequences. The simplest way I could figure out to do this is to replace the accession numbers in the tree file with the host info. When multiple sequences from the same host species are identical, the total number is given in brackets after the host name. When multiple species from the same genus have identical sequences, the genus name is abbreviated after the first species.

 cat Nostoc_rbcX.nwk | ./AddHost.pl Nostoc_rbcX_metadata.txt >Nostoc_rbcX_host.nwk

The resultant tree can be seen here. The table with host information about each sequence can be seen here. I will talk about the patterns revealed by this analysis in my next post.

Posted in Cyanobacterial photobionts | Tagged | Leave a comment

Obtaining the sequences

Perhaps not surprisingly given my background, I will be starting with Nostoc photobionts. In my opinion, the most useful marker for this group is rbcX, so I will be starting there.

I have decided to use blast to obtain all sequences that are homologous to a reference sequence. This will allow me to catch sequences that have been mis-labeled and/or sequences from organisms that have been mis-identified. What better reference sequence to use than one from the original paper that used this marker? (All Perl scripts used below and data files produced are included in the PhotobiontDiversity repository)

I have a local copy of the nt database that was current as of 17 April and I have configured the $BLASTDB environmental variable so that I can just specify the database name. I use custom tabular formatting that includes the lengths of the query and hit sequences:

blastn -task megablast -query Ncommune_rbcX.fa -db nt -evalue 1e-20 -outfmt ‘6 qseqid qlen sacc slen pident length mismatch gapopen qstart qend qframe sstart send sframe evalue bitscore’  -out Nostoc_rbcX.bl

That returns 500 sequences, which is quite a few less than I get if I search for “Nostoc rbcX” in the NCBI nucleotide database (595).

Get the sequences:

 cut -f3 Nostoc_rbcX.bl |sort |uniq >Nostoc_rbcX_acc.txt

blastdbcmd -db nt -entry_batch Nostoc_rbcX_acc.txt -out Nostoc_rbcX.fa

Check if there are any extremely long sequences in the output that will cause the alignment program to choke:

awk ‘{if ($4 > 1100) print $0; }’ Nostoc_rbcX.bl 

gi|2463296|emb|Z94892.1| 964 KC291407 2349 93.84 795 24 14 1 770 1 1179 1973 1 0.0 1173
gi|2463296|emb|Z94892.1| 964 CP003642 7003560 87.16 771 90 4 1 770 1 2391377 2390615 1 0.0 867
gi|2463296|emb|Z94892.1| 964 CP001037 8234322 96.72 488 16 0 283 770 1 5265137 5265624 1 0.0 813
gi|2463296|emb|Z94892.1| 964 CP001037 8234322 97.17 283 8 0 1 283 1 5264778 5265060 1 2e-131 479

There are three sequences over 1100bp. One includes a full-length rbcL sequence, which isn’t going to cause too many problems. The other two are chromosome-sized and will need to be trimmed down:

FilterSeq.pl -i Nostoc_rbcX.fa -max 5000 -o Nostoc_rbcX_filtered.fa

GetSeq.pl Nostoc_rbcX.fa CP003642 | trunc.pl 2391377 2390615 >>Nostoc_rbcX_filtered.fa 

GetSeq.pl Nostoc_rbcX.fa CP001037 |trunc.pl 5264778 5265624 >>Nostoc_rbcX_filtered.fa 

Reverse-complement negative-strand hits:

./RevCom.pl Nostoc_rbcX_filtered.fa < Nostoc_rbcX.bl >Nostoc_rbcX_revcom.fa

Align the sequences with MAFFT:

mafft Nostoc_rbcX_revcom.fa >Nostoc_rbcX_aln.fa

Remove redundant sequences and trim ambiguous regions (this was done using MetaPIGA and saved as Nostoc_rbcX_trimmed.nex).

Replace loooong NCBI names with accession numbers:

perl -pi -e ‘s/^\s*gi\|\d+\|\w+\|(\w+)\.\d\|\S+/$1/’ Nostoc_rbcX_trim.nex 

Convert to Phylip format

ConvertSeq.pl -i Nostoc_rbcX_trimmed.nex -f phyext -o Nostoc_rbcX.phy

Run phylogenetic analysis using PhyML:

phyml -i Nostoc_rbcX.phy

The resulting tree was visualised using FigTree. The most divergent sequences belong to the genus Trichormus and can safely be removed from the dataset. Another sequence belonging to Cylindrospermum is much less diverged and will be used as an outgroup to root the Nostoc tree.

Rerun phyml:

phyml -i Nostoc_rbcX.phy

This produces a reasonable-looking tree. All of the early-branching Nostoc strains are from cultures and presumably belong to different species from the N. commune/N. punctiforme lineage that participates in all known symbioses with lichens and plants. The latter lineage includes a large number of relatively closely related strains, except for one lineage that appears to be evolving much more rapidly. I checked the alignment for these taxa and there are no obvious problems with it, but this is something I will have to look into in more detail in the future.

This tree isn’t to interesting otherwise because there is no information about the host/ecology of any of the strains. Parsing out this information and displaying it will be the subject of my next post.

Posted in Cyanobacterial photobionts | Tagged | 1 Comment

Welcome to PhotobiontDiversity

As with most organisms, DNA sequencing has revolutionised our understanding of the genetic diversity and phylogenetic relationships of lichen photobionts over the last two decades. However, unlike most organisms, these insights have rarely been translated to formal taxonomic changes and thus, no comprehensive system exists to organize the diversity that has been uncovered. Studies focus on different taxonomic scales, sequence different markers and use different analyses. Even when the methods are consistent, comparisons are rarely made to all related sequences in the database. Studies that do attempt a comprehensive analysis, such as this one are hopelessly out of date by the time they are published.

The goal of this blog is to provide a real-time snapshot of the current state of knowledge of genetic diversity in lichen photobionts and related organisms. Over the coming weeks, I will be populating a repository with as many photobiont DNA sequences as possible, along with associated metadata. I will post phylogenies derived from the sequences here and provide commentary as appropriate. I will be updating the phylogenies as new data become available and highlighting interesting findings from the associated studies.

Stay tuned…

Posted in Introduction | Tagged | Leave a comment