Adding host information to the Nostoc phylogeny

Having obtained 496 Nostoc rbcX sequences (plus one outgroup) and used them to infer a reasonable phylogeny, all that is left is to assign host association information to the branches. This will require (a) parsing the sequence files to obtain host info for each sequence, (b) associating each non-redundant sequence in the tree with the host info for all identical sequence, and (c) displaying all of this information on the tree.

The first step turned out to be surprisingly straightforward. The sequences are so consistently named that I was able to apply a “simple” regular expression to extract the host info from the fasta headers of the sequences:

./ < Nostoc_rbcX.fa >Nostoc_rbcX_metadata.txt

This extracted the host genus and species names for all 459 lichen photobionts obtained from direct sequencing (though I had to edit a few of them to remove some extraneous information). The remaining 38 sequences were obtained from cultures (or directly from cyanobacterial colonies) and do not follow the same naming convention, even though a number of the cultures were derived from photobionts. Fortunately, most of these are from a single paper, so it is fairly easy to extract the info from the tables in the paper.

The next problem is adding multiple host names to the branches representing redundant sequence group. I copied the info about redundant sequences from MetaPIGA into the file Nostoc_rbcX_groups.txt. The names in this file are unwieldy because they contain the entire fasta header for each sequence. The first thing I will do is remove everything but the accession number from these names. This is a bit tricky because there are multiple names per line:

 perl -p -e ‘s/gi\|/|/’ Nostoc_rbcX_groups.txt | perl -p -e ‘s/\|\d+\|\w+\|(\w+)\.\d\|[^\|\n]+/$1, /g’ > temp.txt

mv temp.txt Nostoc_rbcX_groups.txt

I will next add this group information to the metadata table:

cat Nostoc_rbcX_metadata.txt | ./ Nostoc_rbcX_groups.txt Nostoc_rbcX.nwk >temp.txt

mv temp.txt Nostoc_rbcX_groups.txt

This script also outputs information about if each sequence is present in the tree so that I can tell which sequence represents each group. MetaPIGA assigns short sequences that match multiple different longer sequences to multiple groups. I assigned these ambiguous sequences to the most common matching sequence type.

Next, I add the host info to the tree for the unique sequences. The simplest way I could figure out to do this is to replace the accession numbers in the tree file with the host info. When multiple sequences from the same host species are identical, the total number is given in brackets after the host name. When multiple species from the same genus have identical sequences, the genus name is abbreviated after the first species.

 cat Nostoc_rbcX.nwk | ./ Nostoc_rbcX_metadata.txt >Nostoc_rbcX_host.nwk

The resultant tree can be seen here. The table with host information about each sequence can be seen here. I will talk about the patterns revealed by this analysis in my next post.


About heathobrien

I am a genome scientist based in Bristol, England. I am currently studying the genetic basis of neuropsychiatric disorders, but I have also done research on iridescent plants, plant pathogens and lichens. I have continued my work on lichen-associated algae and cyanobacteria for my blog
This entry was posted in Cyanobacterial photobionts and tagged . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s