Obtaining the sequences

Perhaps not surprisingly given my background, I will be starting with Nostoc photobionts. In my opinion, the most useful marker for this group is rbcX, so I will be starting there.

I have decided to use blast to obtain all sequences that are homologous to a reference sequence. This will allow me to catch sequences that have been mis-labeled and/or sequences from organisms that have been mis-identified. What better reference sequence to use than one from the original paper that used this marker? (All Perl scripts used below and data files produced are included in the PhotobiontDiversity repository)

I have a local copy of the nt database that was current as of 17 April and I have configured the $BLASTDB environmental variable so that I can just specify the database name. I use custom tabular formatting that includes the lengths of the query and hit sequences:

blastn -task megablast -query Ncommune_rbcX.fa -db nt -evalue 1e-20 -outfmt ‘6 qseqid qlen sacc slen pident length mismatch gapopen qstart qend qframe sstart send sframe evalue bitscore’  -out Nostoc_rbcX.bl

That returns 500 sequences, which is quite a few less than I get if I search for “Nostoc rbcX” in the NCBI nucleotide database (595).

Get the sequences:

 cut -f3 Nostoc_rbcX.bl |sort |uniq >Nostoc_rbcX_acc.txt

blastdbcmd -db nt -entry_batch Nostoc_rbcX_acc.txt -out Nostoc_rbcX.fa

Check if there are any extremely long sequences in the output that will cause the alignment program to choke:

awk ‘{if ($4 > 1100) print $0; }’ Nostoc_rbcX.bl 

gi|2463296|emb|Z94892.1| 964 KC291407 2349 93.84 795 24 14 1 770 1 1179 1973 1 0.0 1173
gi|2463296|emb|Z94892.1| 964 CP003642 7003560 87.16 771 90 4 1 770 1 2391377 2390615 1 0.0 867
gi|2463296|emb|Z94892.1| 964 CP001037 8234322 96.72 488 16 0 283 770 1 5265137 5265624 1 0.0 813
gi|2463296|emb|Z94892.1| 964 CP001037 8234322 97.17 283 8 0 1 283 1 5264778 5265060 1 2e-131 479

There are three sequences over 1100bp. One includes a full-length rbcL sequence, which isn’t going to cause too many problems. The other two are chromosome-sized and will need to be trimmed down:

FilterSeq.pl -i Nostoc_rbcX.fa -max 5000 -o Nostoc_rbcX_filtered.fa

GetSeq.pl Nostoc_rbcX.fa CP003642 | trunc.pl 2391377 2390615 >>Nostoc_rbcX_filtered.fa 

GetSeq.pl Nostoc_rbcX.fa CP001037 |trunc.pl 5264778 5265624 >>Nostoc_rbcX_filtered.fa 

Reverse-complement negative-strand hits:

./RevCom.pl Nostoc_rbcX_filtered.fa < Nostoc_rbcX.bl >Nostoc_rbcX_revcom.fa

Align the sequences with MAFFT:

mafft Nostoc_rbcX_revcom.fa >Nostoc_rbcX_aln.fa

Remove redundant sequences and trim ambiguous regions (this was done using MetaPIGA and saved as Nostoc_rbcX_trimmed.nex).

Replace loooong NCBI names with accession numbers:

perl -pi -e ‘s/^\s*gi\|\d+\|\w+\|(\w+)\.\d\|\S+/$1/’ Nostoc_rbcX_trim.nex 

Convert to Phylip format

ConvertSeq.pl -i Nostoc_rbcX_trimmed.nex -f phyext -o Nostoc_rbcX.phy

Run phylogenetic analysis using PhyML:

phyml -i Nostoc_rbcX.phy

The resulting tree was visualised using FigTree. The most divergent sequences belong to the genus Trichormus and can safely be removed from the dataset. Another sequence belonging to Cylindrospermum is much less diverged and will be used as an outgroup to root the Nostoc tree.

Rerun phyml:

phyml -i Nostoc_rbcX.phy

This produces a reasonable-looking tree. All of the early-branching Nostoc strains are from cultures and presumably belong to different species from the N. commune/N. punctiforme lineage that participates in all known symbioses with lichens and plants. The latter lineage includes a large number of relatively closely related strains, except for one lineage that appears to be evolving much more rapidly. I checked the alignment for these taxa and there are no obvious problems with it, but this is something I will have to look into in more detail in the future.

This tree isn’t to interesting otherwise because there is no information about the host/ecology of any of the strains. Parsing out this information and displaying it will be the subject of my next post.


About heathobrien

I am a genome scientist based in Bristol, England. I am currently studying the genetic basis of neuropsychiatric disorders, but I have also done research on iridescent plants, plant pathogens and lichens. I have continued my work on lichen-associated algae and cyanobacteria for my blog
This entry was posted in Cyanobacterial photobionts and tagged . Bookmark the permalink.

1 Response to Obtaining the sequences

  1. Pingback: Adding new sequences (Mistakes were made) | PhotobiontDiversity

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s