CAP 5510 / CGS 5166 Homework 1

Due: February 11, 2012.

Using Entrez, GenBank, SwissPROT, Pfam, PROSITE, and BLAST.


p53 is a tumor protein associated with the regulation of cell growth. It is frequently found to be mutated or inactivated in 60% of hereditary cancers. In this assignment we'll get some exposure to some of the key bioinformatics tools and databases on the web to explore p53.

Go to the Entrez database browser at the National Center for Biotechnology Information (NCBI). NCBI is a division of the National Library of Medicine (NLM) at the National Institutes of Health (NIH). This page will soon become our portal of choice, our default start point for any search and exploration. You may want to bookmark it so that you can go there easily. It should take you to a webpage with the title "Entrez, The Life Sciences Search Engine" at the top of the page. Click on "All Databases" to see all the databases you have access to. Study this page briefly. Click on "GenBank" on the top panel. It should take you to the webpage for GenBank. Browse through the various databases that are available from this portal. You will search for the protein "p53". Make sure you search in the protein database by clicking on "Protein". On Jan 25, 2013, this gave me 13747 hits. (The same search two years ago had only 8699 hits.) Modify your search to look for "p53 human" and you should still get 6408 hits. Many updates, mutants and partials are part of the hits. Now modify your search to look for "p53[Protein Name] AND Human[Organism]". This can be achieved as follows: Delete the phrase "p53 human" you typed earlier for the search. Then click on "Advanced Search", type p53, click on "Protein Name" and click on "Add to Search Box". Next type human, click on "Organism" and click on "AND" and then click on "Add to Search Box". This should enter the required phrase "(p53[Protein Name]) AND Human[Organism]" for the search. Now click on "Go" to launch the search. You should still get 21 hits. We still need to narrow down the search even further. So we are going to try a different strategy. Go back to the Entrez database browser and click on "Gene", taking you to Entrez Gene. Entrez Gene is a searchable database of genes from RefSeq Genomes. We will continue the above search after we read up on RefSeq. On a different tab, read about RefSeq. Accession formats are well specified for the RefSeq database.

Q1: Define RefSeq in a sentence or two. All RefSeq nucleotide and protein records start with two specific characters. What are they? Make sure you pay attention to the basic prefixes NC, NM, and NP.

Later you can figure out the RefSeq identifiers for p53 nucleotide and protein sequences. Going back to our search for p53, follow similar steps as before and search for "p53[Gene Name] AND Human[Organism]" at Entrez Gene. Although there are three hits there is just one hit for a gene called "TP53" for "tumor protein p53 (Li-Fraumeni syndrome) [Homo sapiens]". (Oddly enough, perfomring the search with "p53[Protein Name] AND Human[Organism]" at Entrez Gene gives 134 hits, although the first one is the right hit. It also seems to find TP63 and TP73 and lot more.) Click on the first hit. This is the right page and we are going to study this in some detail. Note that the page has an index on the right side ("Table of Contents") and the individual components of the page can be expanded or contracted depending upon what interests you. Before you finish this assignment, you need to understand the nature of information on this page (regardless of whether or not I have explicit questions directing you). Read the summary information on this gene and make sure you understand it. Now look at the section titled "Genomic context". Look at the graphic in this section that shows genes adjacent to TP53 on the chromosome. Genes can lie on the forward strand or the reverse strand.

Q2: What is its Official Symbol and Full name? What is HGNC (which provides the symbol and name information)? What chromosome is this gene on? What are its alternative names? What strand does p53 lie on? Write down the names of the genes adjacent to it along with the strand that the neighbors are on.

From the index on the right side of the page click on "Reference sequences". The identifiers for the nucleotide and protein sequences of p53 are typical of RefSeq identifiers. The RefSeq genomic entry NG_017013.2 provides links to the RefSeq entry and can be viewed as a GenBank, Fasta, or Graphics entry. Click on "Sequence Viewer (Graphics)" for the graphic view with the intron-exon structure of this gene. Alternatively, you can click on "GenBank" to study the GenBank entry. Click on "GenBank" and study the GenBank entry. The Pevsner text has explanations of its contents. A GenBank entry gives you sequence details, tells you where the database submission came from, and gives information about related sequences, and has citations to the scientific literature. Locate its "GI" number. After some preliminary information about the entry, there are some references cited for this entry. This gene/protein has clearly been researched extensively. Many more references are available from PubMed and we will inspect them later. Find the section titled "COMMENT". This provides a summary of information on the p53 gene. This is followed by a section on "FEATURES". Study this carefully, especially the nucleotide sequence at the end of this page. In the "CDS" (coding sequence) section, you will also find the link to the amino acid sequence of the corresponding protein (e.g., protein product NP_000537). Try displaying the sequences in Fasta formats. You can also try out the XML and Graphics formats. Note that the actual sequence of nucleotides is at the bottom of the entry.

Q3: Explain the following terms in about one sentence each: LOCUS NAME, ACCESSION NUMBER, GI NUMBER, RefSeqID. How many nucleotides does the mRNA sequence for human p53 have? What is its GI number? How many residues does the protein sequence for human p53 have? What is its GI number? What were the start and stop codons in the mRNA sequence?

Back to our TP53 page. Under "NCBI Reference Sequences (RefSeq)", there are entries that are independent of genome builds, of which there is one "Genomic" entry and 8 under "mRNA and Protein(s)", and there are three RefSeq entries as part of "builds" of primary assemblies (Annotation Release 104). Let us explore the graphic view of TP53.

Eukaryotic genes consist of exons and introns. These are best viewed in the graphic display. Additionally, there are 8 different poteins that TP53 codes for each obtained by a different aplicing of the mRNA (Alternative Splicing Isoforms). The thick grey line shows the region of the chromosome being considered here. The thick green line represents the entire GenBank entry for this gene. The other lines show the locations of key features in this gene. There is also a slider icon next to the magnifying glass icon to adjust the "Zoom level". Play with this to see it under different levels. The dark blue rectangles are the exons in the mRNA sequence made by this gene. The magenta rectangles are coding exons. Only a part of the mRNA of this gene is translated by the ribosome into amino acid residues. Let's focus on one of the mRNA/proteins, say NM_001126114.2 (the second one) and its corresponding protein NP_001119586.1. Identify the portion of exons 1 and 2 that are not translated. This corresponds the 5' untranslated region (UTR) of the mRNA. Identify the portion of the last exon that is not translated (this is part of the 3' UTR). If you zoom in enough and navigate to the coding exons, they even have the amino acid sequence written underneath them.

Q4: How many coding exons are there in this GenBank entry for the human p53 gene? What are their coordinates and lengths? How many (mRNA) exons? Write down their coordinates and lengths. Write down the coordinates of the untranslated 5' and 3' regions of this gene. Write down the amino acid sequence produced by the first coding exon (i.e., translated part of exon 2).

On the right side of the TP53 page, you will find a long list of Links. Click on "SNP: GeneView". SNP stands for single nucleotide polymorphisms. These are single nucleotide mutations or changes between different versions of the same gene. There is a table of all these SNPs and a graphic summary. Make sure you understand the color legend of the graphic summary. Read about the difference between a "synonymous" and a "non-synonymous" mutation.

Back to the TP53 page. Go to the "Bibliography" section and click on the "PubMed" link. This will take you to the PubMed page (read about PubMed) and will give you a link to each of the over 4300 publications that report on the p53 gene or protein. (There were only about 1500 3 years ago.) Clicking on articles will help you read the abstract of the publication. Clicking on articles that have green or orange strips on them will help you download or read the corresponding publication. All of this indicates the critical role played by p53 and the amount of research that has gone into it. (Optional reading.) To find out more about p53, you are encouraged to look at some of the papers cited on this protein, abstracts can be retrieved via the PubMed links in Entrez, however there are a lot of them (thousands)!

The "GeneRIF" portion lists out functions that p53 may be involved in. Under the "Phenotypes" section, you can find out how p53 is linked to a variety of diseases. Follows the link to one of them (say, Breast Cancer) and find out how p53 may be involved in the disease. Under Pathways information, you can find various processes in which p53 is involved. Go to the "Gene Ontology" section. This annotation is provided by the Gene Ontology Consortium. Later this semester, we will learn more about GO. For now, find out all ontological terms that are associated with TP53.

Read about "GEO Profiles", "OMIM", "Interactions" and "Taxonomy".


We will now explore Swiss-Prot, the best-curated protein sequence database. A related database is TrEMBL, which is the uncurated version of Swiss-Prot. Both have now been integrated into UniProt Go to the UniProt database. Go to the Advanced search page and search for P53 for Humans. Since you already know its length, it should be easy to locate P53_HUMAN. CLick on it and go to the entry for the protein. The entry informs us that p53 acts as a tumor suppressor, and its normal function is to stop cells from growing, or to die at the right time (apoptosis). When something goes wrong with p53, cells can grow in an uncontrolled manner, a hallmark of cancer. Under "Alternative Products", this page suggests 9 isoforms. (Investigate why this is different from the 8 isoforms suggested by the GenBank entry.) The section titled "Natural variations" shows a list of mutations of the p53 gene that cause it to make a different amino acid at some position in the protein, making the person prone to getting cancer. These are usually SNPs (single nucleotide polymorphisms) that cause a substitution of one amino acid for another. They are also referred to as "non-synonymous mutations". Find the tumor-causing substitutions of R (arginine) at position 110. If possible, read the Review paper by Hoolstein et al., Science 253:49-53 (1991).

Q5: What amino acid substitutions of the R at position 110 in the p53 protein are listed as involved in cancers? What SNPs might cause these?

To answer the above question, you will need to go back and find the three nucleic acids in p53 that form the codon that makes the R in position 110 in p53. Then you will have to look in a table of the genetic code to see what codons code for these other amino acids. You can find the genetic code by clicking here.

SWISSPROT gives extensive cross-references to other databases, including GenBank and the mirror site at EMBL (European Laboratory for Molecular Biology), PIR (protein Information Resource), and PDB, the Protein Data Bank, a database of three-dimensional protein structures. We'll look at this later in the assignment. The fact that P53 has PDB links implies that the protein structure has been determined by crystallography or NMR methods.

For now, find the PFAM entry in the p53 SWISSPROT record and click on it. PFAM is a database of multiple alignments of related protein sequences. Sets of protein sequences that have evolved from a common ancestor are very useful in understanding and predicting aspects of protein structure and function. Read the description of the P53 family. Click on "Get alignment". (The default "Colored Alignment" view works quite well. "Jalview" is a Java tool to look at the multiple alignment, if you want to explore this further.) You see a "seed" alignment of 9 protein sequences. P53_human is not on this list; the list contains similar proteins. If you look at the "full" alignment, you will find p53_HUMAN, although it shows only only a small portion of it (318-359). Dashes are inserted so that the corresponding amino acids from all twelve organisms line up in columns. Scan across, and note that some regions of the protein are more highly conserved than others. Multiple alignments will be discussed in class soon.

In the early days, Amos Bairoch, the designer of SWISSPROT, and his collaborators put a lot of effort into developing generalized "signature" motifs that allow particular substitutions in particular places in the motif, in hopes of finding motifs that would have no false positives or false negatives for a given protein family. The motif database they produced is called PROSITE.

Go back to the SWISSPROT p53 page and click on the PROSITE link. Study the entry. The PROSITE entry proposes the completely conserved motif M-C-N-S-S-C-[MV]-G-G-M-N-R-R as a signature motif for the p53 family, and they tested this pattern at the time this work was done, concluding that it found no false positives or false negatives. However, the database has grown considerably since then, as has our ability to locate likely orthologs. Later we'll see how hidden Markov models (HMMs) are a better way to define characteristic "patterns" that are present in protein families, which can then be used to find new members of that family. That is what the HMMs in PFAM are used for. Feel free to explore this aspect of PFAM further; we'll return to this in later assignments.

Now go back to the SWISSPROT record for human p53 and find the list of PDB entries. Click on the ExPASy link for 1TSR. This gives access to information about the structure of the p53 protein from PDB. PDB is the Protein Data Bank, a repository of protein structures solved by x-ray crystallography or by NMR. Each solved structure has a 4 letter identifier. This is the PDB record for 1TSR. This particular structure is p53 bound to DNA. Notice that 1TSR is the structure for the core DNA-binding domain of the protein (here defined as residues 102-292) bound to a piece of DNA. p53 is a DNA binding protein that can influence another protein by binding in front of (i.e. on the 5' side) of its gene and thereby altering the way the gene for that other protein is transcribed, in this case by causing it to make more copies of the protein. In this structure, p53 is "caught in the act", so to speak.

Follow the ExPASy PDB link for 1TSR. (If you follow the MMDB entry, followed by the PubMed link it leads you to the 1996 Science paper that describes this structure. You can read the abstract online or go the library to look at a copy of the paper.) Click on "Still image" to see a picture of the structure of p53.


We now move on to BLAST. Go to the BLAST homepage. Follow the link for Blasting 2 sequences (bl2seq). Choose the "blastp" program. Align the 2 proteins with Swiss-Prot IDs P53_HUMAN and P53_XENLA.

Q6: Print out the alignment output by BLAST.

Play around with the various options that BLAST offers and see how it affects the output. Go to the Standard Protein-Protein Blast page. Cut and paste a FASTA formatted version of p53 (gi 3041867) into the "Search" box. Then "Blast" it. When it returns, format it so that the Number of Descriptions is limited to 15, and the alignment view is "Pairwise". Study the pairwise alignment shown at the bottom of the results page. Now let's do the regular protein BLAST. Go to the BLAST homepage. Follow the link for protein-protein BLAST (blastp). Cut and paste the Fasta version of p53 (gi 3041867 or Swiss-Prot ID p53_HUMAN) in the box denoted "Search". For "Choose database", click on swissprot. Make sure the "Low Complexity Filter" is turned on (this is the default). Study the other default parameters used by blastp. Now click on "BLAST!". This may take a few minutes to respond with an answer. Which ones are significant and why? The first hit on the results page corresponds to the original sequence itself. Pick one of the other hits and study the pairwise alignment. Read the BLAST tutoral pages and find answers to the following questions. There is no need to write down your answers. This is merely for your benefit. Find the definitions of the following terms: Score (bits), E-Value, Indentities, Positives, & Gaps . Figure out how Score and E-Values are computed (see Mount's book). Figure out how E-Value is different from P-Value of an alignment.

Now scroll down to the bottom of the results page. You will see a summary of your search. Figure out what the information down there means.


Q7 (do not submit): Repeat questions 2 through 4 for the human protein "insulin". Also study some of the interesting SNPs. This is just an exercise and is not for submission.


Q8: Run the Needleman-Wunsch global sequence alignment algorithm on the sequences "VEPPLSQETFSDLWKLLPENNVLSPL" and "MDPPLSQETFEDLWSLLPDPL" using the BLOSUM62 substitution matrix, gap open and gap extension penalties of 11 and 1 respectivey. Write down the optimal alignment and the alignment score.