CAP 5510/CGS 5166 Homework 2

CAP 5510 / CGS 5166: Homework 2

Due: April 13, 2015.

Part A: BLAST and PSI-BLAST.

As before, find the Sex determining region (SRY) gene from the human genome in the "Gene" database from the NCBI Entrez portal. Again, as before, click on "reference sequence detials" and find the protein sequence for this gene (click on ID that starts as "NP_"). Print out this sequence in FASTA format.

Q1: Print the SRY protein in FASTA format. Write down the RefSeq ID and the GI number of this protein.

Go to the NCBI BLAST page and click on "blastp".

Q2: Now run it through BLASTP, like you did in the first assignment and answer the following questions. Make sure you confine your search to the RefSeq database.

What scoring matrix and gap penalties were used?
What value of K and l were used for calculating the Expect scores for the gapped alignment (please note that there are two sets of these paramaters - one for ungapped and one for gapped alignments)? Where do these values come from?
The score shown in the program output is in units of "normalized bits" = [(l x raw score) - ln K] / ln2. The raw score is shown in parentheses. What are the units of the raw score (those of the BLOSUM62 matrix)? Calculate the raw score in bits from the "normalized bits" for the top three hits.
How many database sequences were searched?
The best hit must be a "self-hit". Is the alignment of the next highest scoring sequence significant and why? What condition should be tested to decide significance?
What was the lowest reported score in this search, and is this score significant? Why?

We are now going to try PSI-BLAST. Read about it by going to following tutorial (Click here). PSI-BLAST is a version of the BLAST algorithm that uses the results from an initial search for similar protein sequences to construct a type of scoring matrix that can then be used for additional rounds of searches, called iterations. The variability found in each column of the scoring matrix allows additional sequences that have different combinations of amino acids in the sequence positions to be found. The algorithm provides a rapid but less precise search than other methods because the scoring matrix produced is only approximate and includes most of the original query sequence. (Caution: The iterations can lead to more sequences being added that do not share a region in common with the original query sequence, but share a totally different region in some of the added sequences; e.g., these new sequences are not true family members but foreigners.) The process will stop when no more sequences are found. The user can control the number of sequences to be included at each iteration or else use the score cutoff recommended by the program. The method is often used to perform a rapid and preliminary search for members of a sequence family. The found sequences can then be multiply aligned by other better-defined methods.

Go to the NCBI BLAST page and click on "PSI-BLAST". Cut and paste the SRY protein into the PSI-BLAST form. Make sure you confine your search to the RefSeq database and not to the nr database. It will find you a large number of hits labeled "Results of PSI-BLAST iteration 1". After studying this page, click on "Run PSI-BLAST iteration 2". Inspect the results and click to run more iterations. In each iteration it spreads its net wider and finds close relatives of the close relatives.

Q3: After iteration 1, how many hits were "Sequences with E-value WORSE than threshold"? How many "New" hits did you get in iterations 2. Continue on for 4 more iterations and in each case, find the number of "New" hits? What was the E-value of the worst hit in each iteration? PSI-BLAST was designed by Altschul, Madden, Scahffer, and others. Go to the Entrez database browser and look for their publication on this topic. Instead of searching in the protein or nucleotide or genome database of Entrez, search in the publications database (PubMed). Type in the authors names and perform the search. The authors appear to have at least 3 publications together, of which two are on PSI-BLAST. Access these publications and read them.

Part B: Clusters of Orthologous Genes (COGs).

Clusters of Orthologous Groups of proteins (COGs) were delineated by comparing protein sequences encoded in complete genomes, representing major phylogenetic lineages. Each COG consists of individual proteins or groups of paralogs from at least 3 lineages and thus corresponds to an ancient conserved domain. You can read up on the idea behind COGs by clicking here. Read the two articles (one from Science (1997) and one from BMC Bioinformatics (2003) referenced on the COG home page. COGs help to provide functional annotations for genes in an organism.

As you have done earlier, go to the "Gene" database at NCBI and find the information on gene "recA" in organism Escherichia coli (any strain). Look for "COG CLassification" on that page.

Q4: What is the COG number of the recA gene? Follow the link to the hyperlinked COG number that List the 25 functional categories listed in the COG database (they are specified by a single letter code). What letter code is the classification for recA? Summarize in a couple of sentences the rest of the information on the COG page for recA. How many genes/proteins in E. coli have COG classifications? Provide a table of how many proteins are classified under each of the 25 functional categories?

Part C: Multiple Alignment.

CLUSTALW is a widely used multiple sequence alignment tool. This assignment will expose you to the features and capabilities of this program.

Go to Entrez
You need to download the protein U1A from four different organisms (human, mouse, Xenopus laevis, and Drosophila melanogaster). For some of these organisms, you may find several proteins being found by Entrez. Pick the ones with the following REFSEQ accessions numbers: human - 4759156; mouse - 114052106; Xenopus laevis - 65181; Drosophila melanogaster - 17737284. You can verify that you have the correct sequences by looking in the GenPept Report for the sequences you picked. For each of the four sequences, open the FASTA report and copy them all into one file.
Go to CLUSTALW.
Type in your e-mail address. Either upload your file or paste your sequences into the appropriate box. Now run CLUSTALW.
Study the output that you get within a few moments.
Try the "JalView" option.
Go back to the CLUSTAL page, and try it again after changing some of the settings. Try a different substitution matrix. Also try different gap penalties.
JalView has an option to mail yourself the postscript version of the alignment. Try this option.
Go back to the CLUSTAL page, and change the "TREE TYPE" to "phylip". Look at the tree that is output.
Pick 4 more (new) sequences that are related to the U1A proteins (run it through BLAST and pick 4 hits with good E-scores). Now align all 8 of the sequences using the "TREE TYPE" of "phylip".

How long are the 4 sequences? What are the pairwise alignment scores?

Q5: What do the "*", ":", and the "." in the alignment indicate? Consult the substitution matrix values, if necessary. Were there any differences in the alignment when you tried two different substitution matrices (PAM and BLOSUM)?

Q6: What sequence formats are supported by CLUSTALW?

Q7: How can the tree information be interpreted? Print/Draw the tree that you obtained when you used the "TREE TYPE" of "phylip" with the 8 sequences that you aligned? If the labels show just the GI numbers, make sure you write in the names of the organisms next to it. What do the numbers in the tree information mean? You can download your own version of CLUSTAL (called ClustalX) from ftp://ftp-igbmc.u-strasbg.fr/pub/ClustalX/ Download the appropriate version for your machine. Versions for other operating systems also exist.

Q8: Download ClustalX for your personal machine and write down the differences you see in the alignment (for the same input as before) from the Web-based ClustalW version you used above. You can visualize the tree, assuming you also downloaded the tree visualization components. Print out the resulting trees in one of the four formats.

Q9: What is BLASTZ? Write a short paragraph on what this tool can do for you. You may not find it on the BLAST page. You will need to do a web search for it.

Part C: Hidden Markov Models.

Here is a small alignment of 12 members of a DNA sequence family.

(column:  1234)
seq1      GATC
seq2      CTAG
seq3      GATC
seq4      CC-G
seq5      GATC
seq6      CC-G
seq7      GTAC
seq8      CG-G
seq9      GCGC
seq10     CTAG
seq11     GATC
seq12     CTAG

Suppose you were to build a profile HMM of this alignment. The profile has four match states; match state 1 is assigned to the symbols in column 1, etc.

Q10: Draw a profile HMM in terms of states (circles) and state transitions (arrows). Mark the transition probabilities on the edges. Make sure you remove all edges with zero transition probabilities. You need to use the "Learning Algorithm" we discussed in class for HMMs. Note that unless you remove states that have no probability of being reached from the "Begin" state, you will be unable to work out this problem by hand.

Q11: Calculate the emission probability parameters for A,C,G,T in match state 1 (column 1). Do a maximum likelihood estimate, i.e., ratio of the frequency of that character being emitted to the sum of frequencies of all the characters.

Q12: Using the above answer, calculate the "log odds scores" (equal to the log of the ratio of its emission probability to its background frequency) for A,C,G,T in match state 1. Assume that the expected background frequencies of A,C,G,T are each 0.25. Use log base two so your scores are in units of bits.

Q13: Column 3 has gap symbols which would be assigned to delete state 3. Calculate the scores (log_2 probabilities) for the match_2 -> match_3 state transition and the match_2 -> delete_3 state transition.

Q14: Calculate the HMM log odds score (in bits) for the sequence

 GAAG

and the sequence
GATC
Did you factor in all the relevant transition and emission probabilities for the calculcations above? Notice that columns 1-4 and 2-3 covary as if they are Watson-Crick base pairs. It would therefore seem that the sequence GAAG should not be a true member of the sequence family. (Hint: the score will be the sum of four emission log-odds probabilities and one state transition log probability, since all other state transitions have probability one in this case. Also, make the Viterbi assumption that the obvious alignment of the four symbols to the four match states is correct, so you do not need to sum over all possible paths.) Now recall the discussions we had in class about the disadvantages of HMMs for the next question.

Q15: Is the HMM a good model of the pairwise correlations? Comment on the limitations of the HMM model.