COP 6936/CAP 6990 Homework 1

Due: October 1, 2002.

Using Entrez, SwissPROT, and BLAST.


A few years ago the tumor protein, p53, was elected molecule of the year. p53 is associated with the regulation of cell growth, and is frequently found to be mutated or inactivated in 60% of hereditary cancers. In this assignment we'll get some exposure to some of the key bioinformatics tools and databases on the web by exploring p53.

Go to the Entrez database browser at the National Center for Biotechnology Information (NCBI). Search in the protein database for p53. You should get about 1613 hits. Modify your search to look for "p53 human" and you will get about 813 hits. Now modify your search to look for "p53[Protein Name] AND Human[Organism]" and you should get 5 hits. Each of the 5 entries has a "gi" number associated with it. Open each one of them and look at the "source"/"SOURCE" to decide which is the correct one you want to study. When you click on a protein, it will display it in "Default" (i.e., GenPept) format. This report tells where the database submission came from, and gives information about the protein, including citations to the scientific literature. The actual sequence of amino acids is at the bottom of the report.

Q1: Of the 5 hits, why would you eliminate proteins with "gi" numbers: 14993574, 14993572, and 237830 on the basis of the "source"?

Actually, the protein with gi 642241 is not the right one either.

Q2: How many amino acids are there in the human p53 protein?

View the other types of reports for this protein. The FASTA report is a very simple, machine readable format file consisting of a description line beginning with ">", in this case giving various names, a gi number, an accession number (AAC12971), followed by the sequence of amino acids on one or more lines. Most bioinformatics analysis programs accept input sequences in FASTA format. The ASN.1 report is a more complex, structured machine readable information file. This format is standard at NCBI. Try the "Graphics" and the "XML" formats for this protein.

Now go back to the "GenPept" format, look under the entry for "DBSOURCE" and follow the link for accession "U94788.1". Now you have gone from the protein world to the nucleotide world. Click to see the GenBank report for this protein. This record describes the human gene that makes the p53 protein. Also look at it under the ASN.1 and FASTA formats. The DNA sequence for the gene is given at the bottom.

Eukaryotic genes consist of exons and introns. These are better viewed in the graphical display. View it under "Graphics" format. The thick light blue line represents the entire GenBank entry for this gene (the "OVERVIEW line"), and the other lines show the locations of key features in this gene. There is also a navigational icon on the right hand side to adjust the "Zoom level". Play with this to see it under different levels. The magenta rectangles are coding exons (CDS). These are parts of the gene that are translated by the ribosome into amino acids. The thinner dark blue rectangles are the exons in the mRNA sequence made by this gene. Note how they correspond to the coding exons, except that the last one extends further (this extension is the 3' untranslated region (UTR) of the mRNA) and there is an extra exon quite a distance before the first coding exon (this is part of the 5' UTR).

Q3: How many base pairs are there in this GenBank entry for the human p53 gene? How many coding exons? How many (mRNA) exons?

Zoom in to a region between bases (about) 11,000 and 13,000.

Q4: Does the first coding exon coincide with the second mRNA exon, or does it contain some part of the 5' UTR? What is the amino acid sequence produced by the first coding exon, and what bases of the 5' UTR are included in the corresponding mRNA exon, if any?

You can navigate back and forth using the left and right arrows. Go back to the GenBank entry for this gene. Several "conflicts" are mentioned. More importantly, several "variations" are listed in the GenBank entry. These variations correspond to alleles, i.e., some people have different version of the gene. They are marked in mustard color in the Graphics entry. Identify at least one of these on your sequence. Look at the variation specified for location 12139.

Q5: What are the possible nucleotides at location 12139? Is it within a coding exon? Does this change imply a change in the translated amino acid? If so, what are the possible amino acids for this location? Repeat the process for location 12032.

(Optional reading.) To find out more about p53, you are encouraged to look at some of the papers cited on this protein, abstracts can be retrieved via the PubMed links in Entrez, however there are a lot of them (thousands)! One good way to get a start researching a protein is to go to SWISSPROT, the best-curated protein sequence database.

Go to the SWISSPROT database. Go to the Advanced search page and search for "Description" P53 under "Organism" Human. Since you already know its length, it should be easy to locate P53_HUMAN. CLick on it and go to the entry for the protein. After the list of 56 references on the protein, the comments fields tell us, among other things, that p53 acts as a tumor suppressor, and its normal function is to stop cells from growing, or to die at the right time (apoptosis). When something goes wrong with p53, cells can grow in an uncontrolled manner, a hallmark of cancer. Scan to near the bottom of the record, and you will find a list of many mutations of the p53 gene that cause it to make a different amino acid at some position in the protein, making the person prone to getting cancer. These are usually SNPs (single nucleotide polymorphisms) that cause a substitution of one amino acid for another. Find the tumor-causing substitutions of R (arginine) at position 110.

Q6: What amino acid substitutions of the R at position 110 in the p53 protein are listed as involved in cancers? What SNPs might cause these?

To answer the last question, you will need to go back and find the three nucleic acids in p53 that form the codon that makes the R in position 110 in p53. Then you will have to look in a table of the genetic code (see Chapter 13 of Pevzner's text, or chapter 2 of Gibas and Jambeck) to see what codons code for these other amino acids.

SWISSPROT gives extensive cross-references to other databases, including GenBank and the mirror site at EMBL (European Laboratory for Molecular Biology), PIR (protein Information Resource), and PDB, the Protein Data Bank, a database of three-dimensional protein structures. We'll look at this later in the assignment. The fact that P53 has PDB links implies that the protein structure has been determined by crystallography or NMR methods.

For now, find the PFAM entry in the p53 SWISSPROT record and click on it. PFAM is a database of multiple alignments of related protein sequences. Sets of protein sequences that have evolved from a common ancestor are very useful in understanding and predicting aspects of protein structure and function. Read the description of the P53 family. Click on "Get alignment". (The default "Colored Alignment" view works quite well. "Jalview" is a Java tool to look at the multiple alignment, if you want to explore this further.) You see an alignment of 14 protein sequences. The fourth from last corresponds to P53_human; the rest are similar proteins from different organisms. Dots are inserted so that the corresponding amino acids from all twelve organisms line up in columns. Scan across, and note that some regions of the protein are more highly conserved than others. Multiple alignments will be discussed in class soon.

Find the arginine at position 110 of human p53 in this alignment. It is about 1/3 of the way through. (If you clicked on one of the substitutions for this amino acids on the SWISSPROT page, then you got a context which told you that QGSYGF precedes this arginine and LGFLH follows it. This is useful in checking if you have the right arginine.)

Q7: What other amino acids occur in this position in the other organisms listed in this multiple alignment? (list them).

These amino acid substitutions presumably do not disrupt p53's function, since they are tolerated in these other organisms. However, the SWISSPROT file for human p53 lists 3 tumor-associated substitutions for position 110. Presumably these are disruptive.

Q8: (Optional) Are there amino acid properties that distinguish the (presumably) disruptive from the (presumably) non-disruptive substitutions? Which properties?

You will need to read some books on Proteins (recommended reading: "Introduction to Protein Structures", by Branden and Tooze). Amino acids are broadly classified as hydrophobic, charged, or polar. The question asks if anything can be said about the residues in location 110?

Now repeat the previous two questions, but instead use the arginine at position 248 in the human p53 protein. You'll find that substitutions of this residue also make a person prone to cancer. In this case there are even more disease-associated substitutions. This residue occurs about 2/3 of the way through the alignment. This region of the protein is highly conserved among the 14 species, and in particular, all proteins have arginine in this position. One might conclude that perhaps the residue in this position of the protein must be arginine for the protein to function and for the organism to be healthy. In fact, if you only look at human p53 and it's very close orthologs (corresponding proteins in different species, presumably descended from a common ancestor protein), it seems like the whole sequence of residues MCNSSCMGGMNRRP (and more) is completely conserved. For many years people used the conserved "motif" MCNSSCMGGMNRRP that occurs in p53 at this place as a signature sequence of p53, searching for this string in proteins from other organisms to find orthologs of p53. This string produced few false positives (proteins that have this motif in them but are not orthologs of p53) and few false negatives (proteins that are orthologs of p53 but do not have this motif in them.) However, looking at the more distant members of the family in this alignment, in particular, the first sequence (Q27937), which is from a squid, we see that many positions in this motif can vary. Very few individual residues in a typical protein are absolutely essential, in that no substitutions exist that preserve function. Arginine 248 in human p53 may be one of them, but some of its adjacent amino acids certainly are not. In general, distantly related orthologs cannot be found by searching for "signature" sequences like this. Either the signature is too short, in which case you get too many false positives, or the signature is too long, in which case you get too many false negatives. We'll look at the probability theory behind this in a future assignment.

In the early days, Amos Bairoch, the designer of SWISSPROT, and his collaborators put a lot of effort into developing generalized "signature" motifs that allow particular substitutions in particular places in the motif, in hopes of finding motifs that would have no false positives or false negatives for a given protein family. The motif database they produced is called PROSITE.

Go back to the SWISSPROT p53 page and click on the prosite link. Study the entry. The PROSITE entry proposes the completely conserved motif M-C-N-S-S-C-[MV]-G-G-M-N-R-R as a signature motif for the p53 family, and they tested this pattern at the time this work was done, concluding that it found no false positives or false negatives. However, the database has grown considerably since then, as has our ability to locate likely orthologs. Later we'll see how hidden Markov models (HMMs) are a better way to define characteristic "patterns" that are present in protein families, which can then be used to find new members of that family. That is what the HMMs in PFAM are used for. Feel free to explore this aspect of PFAM further; we'll return to this in later assignments.

Now go back to the SWISSPROT record for human p53 and find the list of PDB entries. Click on the ExPASy link for 1TSR. This gives access to information about the structure of the p53 protein from PDB. PDB is the Protein Data Bank, a repository of protein structures solved by x-ray crystallography or by NMR. Each solved structure has a 4 letter identifier. This is the PDB record for 1TSR. This particular structure is p53 bound to DNA. Notice that 1TSR is the structure for the core DNA-binding domain of the protein (here defined as residues 102-292) bound to a piece of DNA. p53 is a DNA binding protein that can influence another protein by binding in front of (i.e. on the 5' side) of its gene and thereby altering the way the gene for that other protein is transcribed, in this case by causing it to make more copies of the protein. In this structure, p53 is "caught in the act", so to speak.

Follow the ExPASy PDB link for 1TSR. (If you follow the MMDB entry, followed by the PubMed link it leads you to the 1996 Science paper that describes this structure. You can read the abstract online or go the library to look at a copy of the paper.) Click on "Still image" to see a picture of the structure of p53.

Let's revisit the 2 protein sequences with the gi numbers of 642241 and 3041867, that we found when doing a search for p53 in the Entrez database browser. We are going to try and align them using BLAST. Go to the BLAST homepage. Follow the link for Blasting 2 sequences. Choose the "blastp" program, and enter the GI numbers given above. Click on "Align". Print out the alignment output by BLAST.

Q9: Now can you explain why we decided to eliminate gi 642241?

Go to the GenPept entry for gi 3041867. Click on "Links" on the right hand side of the page. From the pulldown menu, click on "BLink". This is essentially what you would get if you did BLAST using this protein sequence.

Go to the Standard Protein-Protein Blast page. Cut and paste a FASTA formatted version of p53 (gi 3041867) into the "Search" box. Then "Blast" it. When it returns, format it so that the Number of Descriptions is limited to 15, and the alignment view is "Pairwise". Study the pairwise alignment printed out.

This homework was based on a homework designed by David Haussler for his course on Bioinformatics at University of California, Santa Cruz.