Q1: Print the SRY protein in FASTA format. Write down the RefSeq ID and the GI number of this protein.
Go to the NCBI BLAST page and click on "blastp".
Q2: Now run it through BLASTP, like you did in the first assignment and
answer the following questions. Make sure you confine your search to the RefSeq database.
We are now going to try PSI-BLAST. Read about it by going to following tutorial (Click here). PSI-BLAST is a version of the BLAST algorithm that uses the results from an initial search for similar protein sequences to construct a type of scoring matrix that can then be used for additional rounds of searches, called iterations. The variability found in each column of the scoring matrix allows additional sequences that have different combinations of amino acids in the sequence positions to be found. The algorithm provides a rapid but less precise search than other methods because the scoring matrix produced is only approximate and includes most of the original query sequence. (Caution: The iterations can lead to more sequences being added that do not share a region in common with the original query sequence, but share a totally different region in some of the added sequences; e.g., these new sequences are not true family members but foreigners.) The process will stop when no more sequences are found. The user can control the number of sequences to be included at each iteration or else use the score cutoff recommended by the program. The method is often used to perform a rapid and preliminary search for members of a sequence family. The found sequences can then be multiply aligned by other better-defined methods.
Go to the NCBI BLAST page and click on "PSI-BLAST". Cut and paste the SRY protein into the PSI-BLAST form. Make sure you confine your search to the RefSeq database and not to the nr database. It will find you a large number of hits labeled "Results of PSI-BLAST iteration 1". After studying this page, click on "Run PSI-BLAST iteration 2". Inspect the results and click to run more iterations. In each iteration it spreads its net wider and finds close relatives of the close relatives.
Q3: After iteration 1, how many hits were "Sequences with E-value WORSE than threshold"? How many "New" hits did you get in iterations 2. Continue on for 4 more iterations and in each case, find the number of "New" hits? What was the E-value of the worst hit in each iteration? PSI-BLAST was designed by Altschul, Madden, Scahffer, and others. Go to the Entrez database browser and look for their publication on this topic. Instead of searching in the protein or nucleotide or genome database of Entrez, search in the publications database (PubMed). Type in the authors names and perform the search. The authors appear to have at least 3 publications together, of which two are on PSI-BLAST. Access these publications and read them.
Q4: Summarize in one paragraph the main improvements of their 2001 NAR paper over their earlier NAR paper from 1997. I suggest that you repeat one of the experiments that were reported in the second NAR paper from 2001 (say, the one using sequence GI:4982166). You do not need to report this in your homework.
CLUSTALW is a widely used multiple sequence alignment tool. This assignment will expose you to the features and capabilities of this program.
Q5: What do the "*", ":", and the "." in the alignment indicate? Consult the substitution matrix values, if necessary. What sequence formats are supported by CLUSTALW? You can download your own version of CLUSTAL (called ClustalX) from ftp://ftp-igbmc.u-strasbg.fr/pub/ClustalX/ Download the appropriate version for your machine. Versions for other operating systems also exist.
(column: 1234) seq1 GATC seq2 CTAG seq3 GATC seq4 CC-G seq5 GATC seq6 CC-G seq7 GTAC seq8 CG-G seq9 GCGC seq10 CTAG seq11 GATC seq12 CTAGBuild a profile HMM of this alignment with four match states; match state 1 is assigned to the symbols in column 1, etc.
Q6: Draw a profile HMM in terms of states (circles) and state transitions (arrows). Make sure you remove all edges with zero transition probabilities.
Q7: Calculate the emission probability parameters for A,C,G,T in match state 1 by looking at the charcter in column 1.
Q8: Column 3 has gap symbols which would be assigned to delete state 3. Calculate the scores (log_2 probabilities) for the match_2 -> match_3 state transition and the match_2 -> delete_3 state transition.
Q9: Go to weblogo and create a "logo" for this binding site. Paste the resulting image into the document you submit for this exam. Explain in a few sentences how to interpret this image. Next build a profile matrix (PSSM) for this alignment as discussed in class.
Q10: Which chromosome is this gene on? Write down its coordinates according the UCSC browser. Is it on the positive strand or the reverse strand?
Get used to the UCSC genome browser by navigating left and right and by zooming in and out. You can also change the view by clicking on "Configure" and deciding what things you want to or do not want to see on your browser.