CAP 5510/CGS 5166 Homework 1

BSC 4934 Homework 1

Due: In Class, Tuesday, June 30, 2009.

Using Entrez, GenBank, SwissPROT, Pfam, PROSITE, and BLAST.

p53 is a tumor protein associated with the regulation of cell growth. It is frequently found to be mutated or inactivated in 60% of hereditary cancers. In this assignment we'll get some exposure to some of the key bioinformatics tools and databases on the web to explore p53.

Go to the Entrez database browser at the National Center for Biotechnology Information (NCBI). NCBI is one of the institutes of health under NIH. This page will soon become our portal of choice, our default start point for any search and exploration. You may want to bookmark it so that you can go there easily. It should take you to a webpage with the title "Entrez, The Life Sciences Search Engine" at the top of the page. Click on "All Databases" to see all the databases you have access to. Study this page briefly. Click on "GenBank" on the top panel. It should take you to the webpage for GenBank. On the left panel, you will find a link to Entrez Nucleotide. Click on it. You will search for human protein "p53". Make sure you search in the protein database by clicking on "Protein". On Feb 5, 2008, this gave me 6197 hits. (The same search last year had only 3830 hits.) Modify your search to look for "p53 human" and I still got 3047 hits. Now modify your search to look for "p53[Gene Name] AND Human[Organism]". This can be achieved as follows: Delete the phrase "p53 human" you typed earlier for the search. Then click on "Preview/Index", type p53, click on "Gene Name" and click on "AND". Next type human, click on "Organism" and click on "AND". This should enter the required phrase "p53[Gene Name] AND Human[Organism]" for the search. Now click on "Go" to launch the search. I still got 35 hits. We still need to narrow down the search even further. So we are going to try a different strategy. Go back to the Entrez database browser and click on "Gene", taking you to Entrez Gene. Entrez Gene is a searchable database of genes from RefSeq Genomes.

Q1: Define RefSeq in a sentence or two. RefSeq accession numbers can be distinguished from GenBank accessions by their distinct prefix format of 2 characters followed by an underscore character ('_'). All RefSeq nucleotide and protein records start with two specific characters. What are they?

Later you can figure out the RefSeq identifiers for p53 nucleotide and protein sequences. Continuing on our search for p53, follow similar steps as before and search for "p53[Gene Name] AND Human[Organism]" at Entrez Gene. This gives just one hit for a gene called "TP53" for "tumor protein p53 (Li-Fraumeni syndrome) [Homo sapiens]". Click on it. Read the summary information on this gene and make sure you understand it. Now look at the section titled "Genomic context". Look at the genes adjacent to it on the chromosome. Genes can lie on the forward strand or the reverse strand.

Q2: What is its Official Symbol and Full name? What is HGNC (which provides the symbol and name information)? What chromosome is this gene on? What are its alternative names? What strand does p53 lie on? Write down the names of the genes adjacent to it along with the strand that the neighbors are on.

Next look at the section titled "Genomic regions, transcripts, and products". Notice the identifiers for the nucleotide and protein sequences of p53. As you saw earlier, these are typical of RefSeq identifiers. There is a small graphic in this section showing "coding regions" and "untranslated regions". This shows the intron-exon structure of this gene. It is small and hard to see the details of the structure clearly. We are going to navigate and study this in excruciating detail next. Before we do that, quickly scroll down this page and look at all the sections listed on this page. The different topics are summarized in the "Table of Contents" on the right side of the page. Before you finish this assignment, you need to understand the nature of information on this page (regardless of whether or not I have explicit questions directing you).

In the section titled "Genomic regions, transcripts, and products", click on "reference sequence details". This is how we access RefSeq information on p53. Clicking on "NM_000546.3" or "NP_000537.3" in the subsection titled "mRNA and Protein(s)" will take you to GenBank entries of the corresponding sequences. Study these GenBank entries. They are typical of all GenBank entries. A GenBank entry gives you sequence details, tells you where the database submission came from, and gives information about related sequences, and has citations to the scientific literature. Note that the actual sequence of nucleotides is at the bottom of the report. Locate its "GI" number. After some preliminary information about the entry, there are some references cited for this entry. This gene/protein has clearly been researched extensively. Many more references are available from PubMed and we will inspect them later. Find the section titled "COMMENT". This provides a summary of information on the p53 gene. This is followed by a section on "FEATURES". Study this carefully, especially the nucleotide sequence at the end of this page. In the CDS section, you will also find the amino acid sequence of the corresponding protein (and a link to this protein product -- NP_000537). Try displaying the sequences in Fasta formats. For details about the format, read page 32 of your text (JP). You can also try out the XML and Graph formats.

Each entry in the GenBank database has several identification information. See pages 26-27 of your text (JP).

Q3: Explain the following terms in about one sentence each: LOCUS NAME, ACCESSION NUMBER, GI NUMBER, RefSeqID. How many nucleotides does the mRNA sequence for human p53 have? What is its GI number? How many residues does the protein sequence for human p53 have? What is its GI number? What were the start and stop codons in the mRNA sequence?

Back to our TP53 page, find the subsection titled "Reference assembly". There are four of them. Two are based on the Celera Assembly (do you know what that means?). Not all the links appear to show you the intron-exon structure in detail. Click on "GenBank" link for NT_010718.15 to explore further.

Go to the "display" option in the menu at the top of the page and pull donw the "Graph" option. Eukaryotic genes consist of exons and introns. These are better viewed in the graphic display under the "Graph" format. The thick light blue line shows the region of the chromosome being considered here. The grey line represents the entire GenBank entry for this gene, and the thick red line shows the region whose nucleotide sequence is shown below. The other lines show the locations of key features in this gene. There is also a navigational icon on the right hand side to adjust the "Zoom level". Play with this to see it under different levels. The dark blue rectangles are the exons in the mRNA sequence made by this gene. The magenta rectangles are coding exons. Only a part of the mRNA of this gene is translated by the ribosome into amino acid residues. Identify the portion of exons 1 and 2 that are not translated. This corresponds the 5' untranslated region (UTR) of the mRNA. Identify the portion of the last exon that is not translated (this is part of the 3' UTR). If you navigate to the coding exons, they even have the amino acid sequence written underneath them.

Q4: How many coding exons are there in this GenBank entry for the human p53 gene? What are their coordinates and lengths? How many (mRNA) exons? Write down their coordinates and lengths. Write down the coordinates of the untranslated 5' and 3' regions of this gene. Write down the amino acid sequence produced by the first coding exon (i.e., translated part of exon 2).

On the right side of the TP53 page, you will find a long list of Links. Click on "SNP: GeneView". SNP stands for single nucleotide polymorphisms. These are single nucleotide mutations or changes between different versions of the same gene. There is a table of all these SNPs and a graphic summary. Make sure you understand the color legend of the graphic summary. Read about the difference between a "synonymous" and a "non-synonymous" mutation.

Back to the TP53 page. Go to the "Bibliography" section and click on the "PubMed" link. This will take you to the PubMed page (read about PubMed) and will give you a link to each of the over 1500 publications that report on the p53 gene or protein. Clikcing on articles will help you read the abstract of the publication. Clicking on articles that have green or orange strips on them will help you download or read the corresponding publication. All of this indicates the critical role played by p53 and the amount of research that has gone into it. (Optional reading.) To find out more about p53, you are encouraged to look at some of the papers cited on this protein, abstracts can be retrieved via the PubMed links in Entrez, however there are a lot of them (thousands)!

The GeneRIF portion lists out functions that p53 may be involved in. Under the "Phenotypes" section, you can find out how p53 is linked to a variety of diseases. Follows the link to one of them (say, Breast Cancer) and find out how p53 may be involved in the disease. Under Pathways information, you can find various processes in which p53 is involved. Go to the "Gene Ontology" section. This annotation is provided by the Gene Ontology Consortium. Later this semester, we will learn more about GO.

We will now explore Swiss-Prot, the best-curated protein sequence database. A related database is TrEMBL, which is the uncurated version of Swiss-Prot. Go to the SWISSPROT database. Go to the Advanced search page and search for "Description" P53 under "Organism" Human. Confine your search to only Swiss-Prot and not to TrEMBL. Since you already know its length, it should be easy to locate P53_HUMAN. CLick on it and go to the entry for the protein. After the list of references on the protein, the comments fields tell us, among other things, that p53 acts as a tumor suppressor, and its normal function is to stop cells from growing, or to die at the right time (apoptosis). When something goes wrong with p53, cells can grow in an uncontrolled manner, a hallmark of cancer. Scan to near the bottom of the record, and you will find a list of many mutations of the p53 gene that cause it to make a different amino acid at some position in the protein, making the person prone to getting cancer. These are usually SNPs (single nucleotide polymorphisms) that cause a substitution of one amino acid for another. Find the tumor-causing substitutions of R (arginine) at position 110.

Q5: What amino acid substitutions of the R at position 110 in the p53 protein are listed as involved in cancers? What SNPs might cause these?

To answer the above question, you will need to go back and find the three nucleic acids in p53 that form the codon that makes the R in position 110 in p53. Then you will have to look in a table of the genetic code to see what codons code for these other amino acids. You can find the genetic code by clicking here.

SWISSPROT gives extensive cross-references to other databases, including GenBank and the mirror site at EMBL (European Laboratory for Molecular Biology), PIR (protein Information Resource), and PDB, the Protein Data Bank, a database of three-dimensional protein structures. We'll look at this later in the assignment. The fact that P53 has PDB links implies that the protein structure has been determined by crystallography or NMR methods.

For now, find the PFAM entry in the p53 SWISSPROT record and click on it. PFAM is a database of multiple alignments of related protein sequences. Sets of protein sequences that have evolved from a common ancestor are very useful in understanding and predicting aspects of protein structure and function. Read the description of the P53 family. Click on "Get alignment". (The default "Colored Alignment" view works quite well. "Jalview" is a Java tool to look at the multiple alignment, if you want to explore this further.) You see a "seed" alignment of 9 protein sequences. P53_human is not on this list; the list contains similar proteins. If you look at the "full" alignment, you will find p53_HUMAN, although it shows only only a small portion of it (318-359). Dashes are inserted so that the corresponding amino acids from all twelve organisms line up in columns. Scan across, and note that some regions of the protein are more highly conserved than others. Multiple alignments will be discussed in class soon.

In the early days, Amos Bairoch, the designer of SWISSPROT, and his collaborators put a lot of effort into developing generalized "signature" motifs that allow particular substitutions in particular places in the motif, in hopes of finding motifs that would have no false positives or false negatives for a given protein family. The motif database they produced is called PROSITE.

Go back to the SWISSPROT p53 page and click on the PROSITE link. Study the entry. The PROSITE entry proposes the completely conserved motif M-C-N-S-S-C-[MV]-G-G-M-N-R-R as a signature motif for the p53 family, and they tested this pattern at the time this work was done, concluding that it found no false positives or false negatives. However, the database has grown considerably since then, as has our ability to locate likely orthologs. Later we'll see how hidden Markov models (HMMs) are a better way to define characteristic "patterns" that are present in protein families, which can then be used to find new members of that family. That is what the HMMs in PFAM are used for. Feel free to explore this aspect of PFAM further; we'll return to this in later assignments.

Now go back to the SWISSPROT record for human p53 and find the list of PDB entries. Click on the ExPASy link for 1TSR. This gives access to information about the structure of the p53 protein from PDB. PDB is the Protein Data Bank, a repository of protein structures solved by x-ray crystallography or by NMR. Each solved structure has a 4 letter identifier. This is the PDB record for 1TSR. This particular structure is p53 bound to DNA. Notice that 1TSR is the structure for the core DNA-binding domain of the protein (here defined as residues 102-292) bound to a piece of DNA. p53 is a DNA binding protein that can influence another protein by binding in front of (i.e. on the 5' side) of its gene and thereby altering the way the gene for that other protein is transcribed, in this case by causing it to make more copies of the protein. In this structure, p53 is "caught in the act", so to speak.

Follow the ExPASy PDB link for 1TSR. (If you follow the MMDB entry, followed by the PubMed link it leads you to the 1996 Science paper that describes this structure. You can read the abstract online or go the library to look at a copy of the paper.) Click on "Still image" to see a picture of the structure of p53.

We now move on to BLAST. Go to the BLAST homepage. Follow the link for Blasting 2 sequences (bl2seq). Choose the "blastp" program. Align the 2 proteins with Swiss-Prot IDs P53_HUMAN and P53_XENLA.

Q6: Print out the alignment output by BLAST. (Blast will be covered in class on Monday.)

Play around with the various options that BLAST offers and see how it affects the output. Go to the Standard Protein-Protein Blast page. Cut and paste a FASTA formatted version of p53 (gi 3041867) into the "Search" box. Then "Blast" it. When it returns, format it so that the Number of Descriptions is limited to 15, and the alignment view is "Pairwise". Study the pairwise alignment shown at the bottom of the results page. Now let's do the regular protein BLAST. Go to the BLAST homepage. Follow the link for protein-protein BLAST (blastp). Cut and paste the Fasta version of p53 (gi 3041867 or Swiss-Prot ID p53_HUMAN) in the box denoted "Search". For "Choose database", click on swissprot. Make sure the "Low Complexity Filter" is turned on (this is the default). Study the other default parameters used by blastp. Now click on "BLAST!". This may take a few minutes to respond with an answer. Which ones are significant and why? The first hit on the results page corresponds to the original sequence itself. Pick one of the other hits and study the pairwise alignment. Read the BLAST tutoral pages and find answers to the following questions. There is no need to write down your answers. This is merely for your benefit. Find the definitions of the following terms: Score (bits), E-Value, Indentities, Positives, & Gaps . Figure out how Score and E-Values are computed (see Mount's book). Figure out how E-Value is different from P-Value of an alignment.

Now scroll down to the bottom of the results page. You will see a summary of your search. Figure out what the information down there means.

Q7 (do not submit): Repeat questions 2 through 4 for the human protein "insulin". Also study some of the interesting SNPs. This is just an exercise and is not for submission.

Q8: Run the Needleman-Wunsch global sequence alignment algorithm on the sequences "VEPPLSQETFSDLWKLLPENNVLSPL" and "MDPPLSQETFEDLWSLLPDPL" using the BLOSUM62 substitution matrix, gap open and gap extension penalties of 11 and 1 respectivey. Write down the optimal alignment and the alignment score. (Material to be covered in class on Monday.)