CAP/CGS 5991 Homework 1

Due: February 23, 2006.

Using Entrez, GenBank, SwissPROT, and BLAST.


A few years ago the tumor protein, p53, was elected molecule of the year. p53 is associated with the regulation of cell growth, and is frequently found to be mutated or inactivated in 60% of hereditary cancers. In this assignment we'll get some exposure to some of the key bioinformatics tools and databases on the web by exploring p53.

Go to the Entrez database browser at the National Center for Biotechnology Information (NCBI). Click on "GenBank". It should take you to the webpage for Entrez Nucleotide. Search for "p53". Make sure you search in the protein database by clicking on "Protein". In January 2005, this gave me 2910 hits. Modify your search to look for "p53 human" and I still got 1718 hits. Now modify your search to look for "p53[Protein Name] AND Human[Organism]". This can be achieved as follows: Delete the phrase "p53 human" you typed earlier for the search. Then click on "Preview/Index", type p53, click on "Protein Name" and click on "AND". Next type human, click on "Organism" and click on "AND". This should enter the required phrase "p53[Protein Name] AND Human[Organism]" for the search. Now click on "Go" to launch the search. I still got 231 hits. Refine it further by insisting that the "Gene Name" be "p53". That brings down the number of hits to 27. We still need to find the correct one. We would like to narrow down the search even further. So we are going to try a different strategy. Go back to the Entrez database browser and click on "Gene", taking you to Entrez Gene. Now follow similar steps as before and search for "p53[Gene Name] AND Human[Organism]". This gives just one hit for a gene called "TP53" for "tumor protein p53 (Li-Fraumeni syndrome) [Homo sapiens]". Click on it.

Q1: What chromosome is this gene on? What are its other aliases?

Genes can lie on the forward strand or the reverse strand. p53 lies on the reverse strand. The graphic display at the top of this page describes this pictorially. There is general information about the p53 gene, following which this page lists over 700 publications in PubMed that are related to this gene. After this, there is another long list of proteins that p53 interacts with. All of this indicates the critical role played by p53 and the amount of research that has gone into it. This is followed by all the functional processes that the p53 gene may be involved in. This annotation is provided by the Gene Ontology Consortium. Later this semester, we will learn more about GO. Under the Phenotypes section, you can find out how p53 is linked to a variety of diseases. Follows the link to one of them (say, Breast Cancer) and find out how p53 may be involved in the disease. Under Pathways information, you can find various processes in which p53 is involved. Scroll down to the information on the p53 protein. There you will find links to the mRNA Sequence (NM_000546), the Source Sequences (AH002918,AH002919,U94788,X02469), and the protein Product (NP_000537). Let's first explore the link to the mRNA sequence. Click on "NM_000546". Study this GenBank entry. This report tells where the database submission came from, and gives information about the protein, including citations to the scientific literature. Note that the actual sequence of nucleotides is at the bottom of the report. Locate its "GI" number. After some preliminary information about the entry, there are over 700 references cited for this entry. This is unusual and indicates that this gene/protein has been researched extensively. Find the section titled "COMMENT". This provides a summary of information on the p53 gene. This is followed by a section on "FEATURES". Study this carefully, especially the nucleotide sequence at the end of this page. In the CDS section, you will also find the amino acid sequence of the corresponding protein (and a link to this protein product -- NP_000537.

Q2: How many nucleotides does the mRNA sequence for human p53 have? What is its GI number?

Follow the link to NP_000537. After some initial information, this page too has the same set of references as in NM_000546. At the end of this page, you will find the amino acid sequence for p53.

Q3: How many residues does the protein sequence for human p53 have? What is its GI number? Does the entire mRNA sequence in NM_000546 code for this protein, or are there likely to be some untranslated regions in the mRNA? Why? If not, write down the portion of the mRNA sequence that codes for the protein. What was the stop codon in the mRNA sequence?

The p53 protein was displayed in "Default" or "GenPept" format. View the other types of reports for this protein. The FASTA report is a very simple, machine readable format file consisting of a description line beginning with ">", in this case giving various names, a gi number, an accession number (NP_000537.2), followed by the sequence of amino acids on one or more lines. Most bioinformatics analysis programs accept input sequences in FASTA format. The ASN.1 report is a more complex, structured machine readable information file. This format is standard at NCBI. Try the "Graph" and the "XML" formats for this protein.

Each entry in the GenBank database has several identification information. Read section on "SEQ-IDs: What's in a Name?" in Baxevanis/Ouelette.

Q4: Explain the following terms in about one sentence each: LOCUS NAME, ACCESSION NUMBER, GI NUMBER, RefSeqID.

Next we want to learn about the intron-exon structure of this gene. This information is missing from both the mRNA entry as well as the protein entry. Now go back to the original TP53 gene page in Entrez Gene. Under "Source Sequence", four other sequences were identified. If you follow that link, you will find a webpage that has four GenBank entries concatenated. It is possible to inspect each of the four entries separately. Now follow the link for accession "U94788", which is a nucleotide entry. As before, the DNA sequence for the gene is given at the bottom.

Eukaryotic genes consist of exons and introns. These are better viewed in the graphical display. View it under "Graph" format. The thick light blue line shows the region of the chromosome being considered here. The grey line represents the entire GenBank entry for this gene, and the thick red line shows the region whose nucleotide sequence is shown below. The other lines show the locations of key features in this gene. There is also a navigational icon on the right hand side to adjust the "Zoom level". Play with this to see it under different levels. The dark blue rectangles are the exons in the mRNA sequence made by this gene. The magenta rectangles are coding exons. Only a part of the mRNA of this gene is translated by the ribosome into amino acid residues. Identify the portion of exons 1 and 2 that are not translated. This corresponds the 5' untranslated region (UTR) of the mRNA. Identify the portion of exon 11 that is not translated (this is part of the 3' UTR).

Q5: How many base pairs are there in this GenBank entry for the human p53 gene? How many coding exons? How many (mRNA) exons? Write down the coordinates of the untranslated 5' and 3' regions of this gene. Write down the amino acid sequence produced by the first coding exon (i.e., translated part of exon 2).

In the GenBank entry for this gene, several "conflicts" and "variations" are mentioned. Conflicts refer to inconsistencies. More importantly, variations correspond to alleles, i.e., some people have different versions of the gene. They are marked in mustard color in the Graphics entry. Can you identify at least one of these on the graphics entry. Look at the variation specified for location 12139.

Q6: What are the possible nucleotides at location 12139? Is it within a coding exon? Does this change imply a change in the translated amino acid? If so, what are the possible amino acids for this location? Repeat the process for location 12032.

The above question could have been more easily answered by doing the following. In the GenBank entry, at the top right corner of the page, you will see a hyperlinked word called "Link". Click on it and you will see a pull-down menu. Follows the option "Gene View in dbSNP". All the variations (i.e., single nucleotide polymorphisms or SNPs) for this gene are listed here. Study this table carefully.

(Optional reading.) To find out more about p53, you are encouraged to look at some of the papers cited on this protein, abstracts can be retrieved via the PubMed links in Entrez, however there are a lot of them (thousands)!

We will now explore Swiss-Prot, the best-curated protein sequence database. A related database is TrEMBL, which is the uncurated version of Swiss-Prot. Go to the SWISSPROT database. Go to the Advanced search page and search for "Description" P53 under "Organism" Human. Confine your search to only Swiss-Prot and not to TrEMBL. Since you already know its length, it should be easy to locate P53_HUMAN. CLick on it and go to the entry for the protein. After the list of 79 references on the protein, the comments fields tell us, among other things, that p53 acts as a tumor suppressor, and its normal function is to stop cells from growing, or to die at the right time (apoptosis). When something goes wrong with p53, cells can grow in an uncontrolled manner, a hallmark of cancer. Scan to near the bottom of the record, and you will find a list of many mutations of the p53 gene that cause it to make a different amino acid at some position in the protein, making the person prone to getting cancer. These are usually SNPs (single nucleotide polymorphisms) that cause a substitution of one amino acid for another. Find the tumor-causing substitutions of R (arginine) at position 110.

Q7: What amino acid substitutions of the R at position 110 in the p53 protein are listed as involved in cancers? What SNPs might cause these?

To answer the last question, you will need to go back and find the three nucleic acids in p53 that form the codon that makes the R in position 110 in p53. Then you will have to look in a table of the genetic code to see what codons code for these other amino acids. You can find the genetic code by clicking here.

SWISSPROT gives extensive cross-references to other databases, including GenBank and the mirror site at EMBL (European Laboratory for Molecular Biology), PIR (protein Information Resource), and PDB, the Protein Data Bank, a database of three-dimensional protein structures. We'll look at this later in the assignment. The fact that P53 has PDB links implies that the protein structure has been determined by crystallography or NMR methods.

For now, find the PFAM entry in the p53 SWISSPROT record and click on it. PFAM is a database of multiple alignments of related protein sequences. Sets of protein sequences that have evolved from a common ancestor are very useful in understanding and predicting aspects of protein structure and function. Read the description of the P53 family. Click on "Get alignment". (The default "Colored Alignment" view works quite well. "Jalview" is a Java tool to look at the multiple alignment, if you want to explore this further.) You see an alignment of 7 protein sequences. The fourth one corresponds to P53_human; the rest are similar proteins. Dots are inserted so that the corresponding amino acids from all twelve organisms line up in columns. Scan across, and note that some regions of the protein are more highly conserved than others. Multiple alignments will be discussed in class soon.

Find the arginine at position 110 of human p53 in this alignment. It is about 1/3 of the way through. (If you clicked on one of the substitutions for this amino acids on the SWISSPROT page, then you got a context which told you that QGSYGF precedes this arginine and LGFLH follows it. This is useful in checking if you have the right arginine.)

Q8: What other amino acids occur in this position in the other organisms listed in this multiple alignment? (list them).

These amino acid substitutions presumably do not disrupt p53's function, since they are tolerated in these other organisms. However, the SWISSPROT file for human p53 lists 3 tumor-associated substitutions for position 110. Presumably these are disruptive.

Q9: (Optional) Are there amino acid properties that distinguish the (presumably) disruptive from the (presumably) non-disruptive substitutions? Which properties? One other isoform of p53 is given in this page. What is an isoform? How are these produced? Investigate this further.

You will need to read some books on Proteins (recommended reading: "Introduction to Protein Structures", by Branden and Tooze). Amino acids are broadly classified as hydrophobic, charged, or polar. The question asks if anything can be said about the residues in location 110?

Now repeat the previous two questions, but instead use the arginine at position 248 in the human p53 protein. You'll find that substitutions of this residue also make a person prone to cancer. In this case there are even more disease-associated substitutions. This residue occurs about 2/3 of the way through the alignment. This region of the protein is highly conserved among the 14 species, and in particular, all proteins have arginine in this position. One might conclude that perhaps the residue in this position of the protein must be arginine for the protein to function and for the organism to be healthy. In fact, if you only look at human p53 and it's very close orthologs (corresponding proteins in different species, presumably descended from a common ancestor protein), it seems like the whole sequence of residues MCNSSCMGGMNRRP (and more) is completely conserved. For many years people used the conserved "motif" MCNSSCMGGMNRRP that occurs in p53 at this place as a signature sequence of p53, searching for this string in proteins from other organisms to find orthologs of p53. This string produced few false positives (proteins that have this motif in them but are not orthologs of p53) and few false negatives (proteins that are orthologs of p53 but do not have this motif in them.) However, looking at the more distant members of the family in this alignment, in particular, the first sequence (Q27937), which is from a squid, we see that many positions in this motif can vary. Very few individual residues in a typical protein are absolutely essential, in that no substitutions exist that preserve function. Arginine 248 in human p53 may be one of them, but some of its adjacent amino acids certainly are not. In general, distantly related orthologs cannot be found by searching for "signature" sequences like this. Either the signature is too short, in which case you get too many false positives, or the signature is too long, in which case you get too many false negatives. We'll look at the probability theory behind this in a future assignment.

In the early days, Amos Bairoch, the designer of SWISSPROT, and his collaborators put a lot of effort into developing generalized "signature" motifs that allow particular substitutions in particular places in the motif, in hopes of finding motifs that would have no false positives or false negatives for a given protein family. The motif database they produced is called PROSITE.

Go back to the SWISSPROT p53 page and click on the PROSITE link. Study the entry. The PROSITE entry proposes the completely conserved motif M-C-N-S-S-C-[MV]-G-G-M-N-R-R as a signature motif for the p53 family, and they tested this pattern at the time this work was done, concluding that it found no false positives or false negatives. However, the database has grown considerably since then, as has our ability to locate likely orthologs. Later we'll see how hidden Markov models (HMMs) are a better way to define characteristic "patterns" that are present in protein families, which can then be used to find new members of that family. That is what the HMMs in PFAM are used for. Feel free to explore this aspect of PFAM further; we'll return to this in later assignments.

Now go back to the SWISSPROT record for human p53 and find the list of PDB entries. Click on the ExPASy link for 1TSR. This gives access to information about the structure of the p53 protein from PDB. PDB is the Protein Data Bank, a repository of protein structures solved by x-ray crystallography or by NMR. Each solved structure has a 4 letter identifier. This is the PDB record for 1TSR. This particular structure is p53 bound to DNA. Notice that 1TSR is the structure for the core DNA-binding domain of the protein (here defined as residues 102-292) bound to a piece of DNA. p53 is a DNA binding protein that can influence another protein by binding in front of (i.e. on the 5' side) of its gene and thereby altering the way the gene for that other protein is transcribed, in this case by causing it to make more copies of the protein. In this structure, p53 is "caught in the act", so to speak.

Follow the ExPASy PDB link for 1TSR. (If you follow the MMDB entry, followed by the PubMed link it leads you to the 1996 Science paper that describes this structure. You can read the abstract online or go the library to look at a copy of the paper.) Click on "Still image" to see a picture of the structure of p53.


We now move on to BLAST. Go to the BLAST homepage. Follow the link for Blasting 2 sequences (bl2seq). Choose the "blastp" program. Align the 2 proteins with Swiss-Prot IDs P53_HUMAN and P53_XENLA.

Q10: Print out the alignment output by BLAST.

Play around with the various options that BLAST offers and see how it affects the output. Go to the Standard Protein-Protein Blast page. Cut and paste a FASTA formatted version of p53 (gi 3041867) into the "Search" box. Then "Blast" it. When it returns, format it so that the Number of Descriptions is limited to 15, and the alignment view is "Pairwise". Study the pairwise alignment shown at the bottom of the results page. Now let's do the regular protein BLAST. Go to the BLAST homepage. Follow the link for protein-protein BLAST (blastp). Cut and paste the Fasta version of p53 (gi 3041867 or Swiss-Prot ID p53_HUMAN) in the box denoted "Search". For "Choose database", click on swissprot. Make sure the "Low Complexity Filter" is turned on (this is the default). Study the other default parameters used by blastp. Now click on "BLAST!". This may take a few minutes to respond with an answer. Which ones are significant and why? The first hit on the results page corresponds to the original sequence itself. Pick one of the other hits and study the pairwise alignment. Read the BLAST tutoral pages and find answers to the following questions. There is no need to write down your answers. This is merely for your benefit. Find the definitions of the following terms: Score (bits), E-Value, Indentities, Positives, & Gaps . Figure out how Score and E-Values are computed (see Mount's book). Figure out how E-Value is different from P-Value of an alignment.

Now scroll down to the bottom of the results page. You will see a summary of your search. Figure out what the information down there means.

Parts of this homework was based on a homework designed by David Haussler for his course on Bioinformatics at University of California, Santa Cruz.