Q1: Explain briefly how you would find it. If you type in RecA Ecoli as the search text, you may not find it. Although you may not use this information for the search, the primary accession number for this sequence is P03017. This information is merely for you to make sure that the protein you found is the correct one. Your friend also tells you that its GenBank accession number may be gi:72985. Next try to locate this protein with the accession number gi:72985 from the Entrez database browser and answer the following questions.
Q2: What is its correct gi number? How long is the protein sequence?
Q3: Now run it through BLASTP, like you did in the first assignment and
answer the following questions.
We are now going to try PSI-BLAST. Read about it by going to following
tutorial
(Click here).
PSI-BLAST is a version of the BLAST algorithm that uses the results from an
initial search for similar protein sequences to construct a type of scoring
matrix that can then be used for additional rounds of searches, called iterations.
The variability found in each column of the scoring matrix allows additional
sequences that have different combinations of amino acids in the sequence
positions to be found. The algorithm provides a rapid but less precise search
than other methods because the scoring matrix produced is only approximate and
includes most of the original query sequence. (Caution: The iterations can
lead to more sequences being added that do not share a region in common with
the original query sequence, but share a totally different region in some of
the added sequences; e.g., these new sequences are not true family members
but foreigners.) The process will stop when no more sequences are found.
The user can control the number of sequences to be included at each iteration
or else use the score cutoff recommended by the program. The method is often
used to perform a rapid and preliminary search for members of a sequence family.
The found sequences can then be multiply aligned by other better-defined methods.
First go to the GenBank page for MITF_MOUSE (gi|13124350|sp|Q08874|). Download it
in FASTA format and run it on PSI-BLAST (go to
BLAST and click on the PSI-BLAST option). Use the following changes to the
default values (or else you will get way too many hits). At the bottom of the page,
under "Format" option, look for the options for "Number of: Descriptions"
(Pick 1000 here) and "Alignments" (Pick 0 here). For the options for
"Limit results by entrez query or select from:", pick "Mus Musculus".
Q4: You should have gotten about 40 hits. What default threshold was used for the E-value? How many of these sequences had E-value BETTER than the default threshold? Run PSI-BLAST for two more iterations. How many more hits did you get? PSI-BLAST was designed by Altschul, Madden, Scahffer, and others. Go to the Entrez database browser and look for their publication on this topic. Instead of searching in the protein or nucleotide or genome database of Entrez, search in the publications database (PubMed). Type in the authors names and perform the search. The authors appear to have at least 3 publications together, of which two are on PSI-BLAST. Access these publications and read them.
Q5: Summarize in one paragraph the main improvements of their 2001 NAR paper over their earlier NAR paper from 1997. I suggest that you repeat one of the experiments that were reported in the second NAR paper from 2001 (say, the one using sequence GI:4982166). You do not need to report this in your homework.
CLUSTALW is a widely used multiple sequence alignment tool. This assignment will expose you to the features and capabilities of this program.
Q1: What do the "*", ":", and the "." in the alignment indicate? Consult the substitution matrix values, if necessary. Were there any differences in the alignment when you tried two different substitution matrices (PAM and BLOSUM)?
Q2: What sequence formats are supported by CLUSTALW?
Q3: How can the tree information be interpreted? Can you draw the tree that you obtained when you used the "TREE TYPE" of "phylip" with the 8 sequences that you aligned? What do the numbers in the tree information mean? You can download your own version of CLUSTAL (called ClustalX) from ftp://ftp-igbmc.u-strasbg.fr/pub/ClustalX/ Download the appropriate version for your machine. Versions for other operating systems also exist.
Q4: Download ClustalX for your personal machine and write down the differences you see in the alignment (for the same input as before) from the Web-based ClustalW version you used above. You can visualize the tree, assuming you also downloaded the tree visualization components. Print out the resulting trees in one of the four formats.
Q5: What is BLASTZ? Write a short paragraph on what this tool can do for you. You may not find it on the BLAST page. You will need to do a web search for it.
(column: 1234) seq1 GATC seq2 CTAG seq3 GATC seq4 CC-G seq5 GATC seq6 CC-G seq7 GTAC seq8 CG-G seq9 GCGC seq10 CTAG seq11 GATC seq12 CTAGSuppose you were to build a profile HMM of this alignment. The profile has four match states; match state 1 is assigned to the symbols in column 1, etc.
Q1: Draw a profile HMM in terms of states (circles) and state transitions (arrows). You need to use the "Learning Algorithm" we discussed in class for HMMs. Note that unless you remove states that have no probability of being reached from the "Begin" state, you will be unable to work out this problem by hand.
Q2: Calculate the emission probability parameters for A,C,G,T in match state 1 (column 1). Do a maximum likelihood estimate, i.e., ratio of the frequency of that character being emitted to the sum of frequencies of all the characters.
Q3: Using the above answer, calculate the "log odds scores" (equal to the log of the ratio of its emission probability to its background frequency) for A,C,G,T in match state 1. Assume that the expected background frequencies of A,C,G,T are each 0.25. Use log base two so your scores are in units of bits.
Q4: Column 3 has gap symbols which would be assigned to delete state 3. Calculate the scores (log_2 probabilities) for the match_2 -> match_3 state transition and the match_2 -> delete_3 state transition.
Q5: Calculate the HMM log odds score (in bits) for the sequence
GAAG
and the sequence GATC
Notice that columns 1-4 and 2-3 covary as if they are Watson-Crick
base pairs. It would therefore seem that the sequence GAAG
should not be a true member of the sequence family.
(Hint: the score will be the sum of four emission log-odds
probabilities and one state transition log probability, since all
other state transitions have probability one in this case.
Also, make the Viterbi assumption that the obvious alignment
of the four symbols to the four match states is correct, so
you do not need to sum over all possible paths.) Now recall
the discussions we had in class about the disadvantages of HMMs for
the next question.
Q6: Is the HMM a good model of the pairwise correlations? Comment on the limitations of the HMM model.
Q7: [Extra Credit] How can you modify the HMM model so that it recognizes the correlation between locations? It may help to first ignore the correlation between locations 2-3 and only assume that locations 1-4 have a correlation.