Problem Set (some questions were modified from http://www.ncbi.nlm.nih.gov/Class/FieldGuide/problem_set.html)

 

  1.  

The Entrez Nucleotides database is a collection of sequences from several sources, including GenBank, RefSeq, and PDB. The number of bases grows at an exponential rate. As of April 2004, there are over 38,989,342,565 bases.

 

                              (figure 1)

Use Entrez nucleotides to retrieve the finished record AC009453 from the human genome project (figure 1).  Try different formats (ASN.1, FASTA, XML, etc) to see what these formats look like.  Answer the following questions:

What is the sequence?  How many times has it been updated since it first appeared?  Use “Check sequence revision history” in the left side navigation menu, trace the history all the way back to the first version.  When did this record first appear?  How many bases in the first version?  How many in the current version?

 

 

  1.  

The protein entries in the Entrez search and retrieval system have been compiled from a variety of sources, including SwissProt, PIR, PRF, PDB, and translations from annotated coding regions in GenBank and RefSeq.

BLink ("BLAST Link") displays the results of BLAST searches that have been done for every protein sequence in the Entrez Proteins data domain.

                        (figure 2)

Best hits is an display option which shows the top hit to each organism represented in the BLAST results. The Best Hits display also shows the number of hits to each organism. Click on the number to see only the hits to that organism.                   (figure 3)

Defects in CFTR are the cause of cystic fibrosis (CF).  CF is the most common genetic disease in the Caucasian population, with a prevalence of about 1 in 2000 live births.  Retrieve the SWISS-PROT record for the human CFTR (cystic fibrosis) protein by searching with CFTR_HUMAN in proteins on the search box on the NCBI home page (figure 2). View the record and look at the extensive annotations.

How many primary database records are linked to this record?

How many literature citations are linked?

What is the function of the protein? 

Use the FEATURES table to find the nature and location of the most common mutation in this gene in cystic fibrosis. 

Now use the BLink link to retrieve related proteins, Click the Best Hits button and find the related protein from the fish Fundulus heteroclitus (figure 3).  Follow the PubMed link from this record to read about the biology of this protein. What is the physiological role of this CFTR homologue in this animal?

 

3.       

Proteins often contain several modules or domains, each with a distinct evolutionary origin and function. The Conserved Domain Database (CDD) and Search Service may be used to identify the conserved domains present in a protein sequence.  CDD currently contains domains derived from two popular collections, Smart and Pfam, plus contributions from colleagues at NCBI, such as COG. The source databases also provide descriptions and links to citations. Since conserved domains correspond to compact structural units, CDs contain links to 3D-structure via Cn3D whenever possible.

COGs - Clusters of Orthologous Groups - natural system of gene families from complete genomes.  Clusters of Orthologous Groups (COGs) were delineated by comparing protein sequences encoded in 43 complete genomes, representing 30 major phylogenetic lineages.  Each COG consists of individual proteins or groups of paralogs from at least 3 lineages and thus corresponds to an ancient conserved domain. 

(figure 4)

Find the genomic scaffold AE003584 from Drosophila melanogaster using Entrez Nucleotide. Click upright “Links” then choose protein to see the predicted proteins for this scaffold (figure 4). (You will need to increase the number of records displayed to see all of the proteins on one page. Then use the browser's "Find in page" function to find the protein that you want.) Identify conserved domains present in predicted protein AAF51293 by clicking on “domains”. This conserved domain suggests a potential function for this hypothetical protein. Now perform a search against the Prosite patterns using the ScanProsite tool (http://ca.expasy.org/tools/scanprosite/) at ExPASy.  Also try InterPro (http://www.ebi.ac.uk/InterProScan/). 

 

4.      Use Entrez Nucleotide to find the full-length cDNA (mRNA) sequence for Plasmodium falciparum glyceraldehyde 3-phosphate dehydrogenase (GAPD). This time start by typing Plasmodium in the search box without limiting to any field. How many records do you retrieve? Browse through your results to find some records that are not from Plasmodium. Display a few of these to see why you retrieved them; you should find "Plasmodium" somewhere on the record. Now use the Limits tab to restrict to Plasmodium in the Organism field [Organism] (figure 5). How many nucleotide records in Entrez are from Plasmodium? Now find GAPD records by using the Preview/Index tab to add glyceraldehyde 3-phosphate dehydrogenase as a [Title] Word. How many records did you retrieve?

                  (figure 5)

5.      Inositol polyphosphate phosphatases contain conserved acidic residues involved in binding metal ions. Retrieve the human INPP1 protein (INPP_HUMAN) from Entrez proteins. Follow the "Domains" link to display pre-computed CDD search results. Click on the "Details" button to display the complete results. Follow the link to the pfam inositol_P domain and display the domain in Cn3D by clicking on the "View 3D Structure" button (figure 6). Identify the conserved residues surrounding the magnesium ions by double clicking on them in the structure. The corresponding residues will be highlighted in the sequence alignment. You can annotate the side chains on these if you like. First change the setting on the CDD page from "Virtual Bonds" to "All Atoms" then display the structure. You can then use the Style->Edit Global Style menu to turn off side chains and the Style->Annotate menu to selectively turn on the side chains for amino acids that coordinate the magnesium ions (figure 7).

(figure 6)

 

                                (figure 7)

6.      Suppose you have following protein sequences and you think these sequences may share a motif.  Copy and paste the following sequences to PRATT (http://ca.expasy.org/tools/pratt/).  Keep “Directly submit best pattern to ScanProsite” checked and then hit “submit Query” (figure 8). 

 

>seq 1

MSTTSTPTATTAAFTDCHVRDLSLAAWGRKEMVIAETEMPGLMAIREEYAASQPLKGARIAGSLHMTIQT

AMLIETLTALGAEVRWASCNIFSTQDHAAAAIAAAGIPVFAYKGESLEEYWEFTHRIFEWHDGGTPNMIL

DDGGDATLLLHLGSDAEKDPSVVANPTCEEEQFLFAAIKKRLAEKPEWYSKTAAAIKGVTEETTTGVHRL

YQMHEKGRLKFPAINVNDSVTKSKFDNIYGCRESLVDGIKRATDVMVAGKVAVICGYGEVGKGCAQAMRG

LQAQVWVTEIDPICALQAAMEGYKVVTMEWAADKADIFVTTTGNINVITHDHMKAMKHNAIVCNIGHFDN

EIEVAALKQYQWENIKPQVDHIIFPDGKRIILLAEGRLVNLGCATGHPSYVMSSSFANQTLAQMELFCNP

GKYPVGVYMLPKELDEKVARLQLKTLGAMLTELTEEQAAYIGVPKAGPYKTDHYRY

>seq 2

MSAPAHKFKVADLSLAAFGRKEIELAENEMPGLMATRKKYAADQPLKGARIAGCLHMTIQTAVLIETLTA

LGAEVTWSSCNIFSTQDHAAAAIAAAGVPVFAWKGETEEEYQWCLEQQLIAFKDNKKLNLILDDGGDLTH

LVHTKYPEMLEDCFGVSEETTTGVHHLYRMLKEGKLLVPAINVNDSVTKSKFDNLYGCRESLVDGIKRAT

DVMIAGKIAVVAGFGDVGKGCAMALSGMGARVIVTEVDPINALQAAMAGYQVTTMEKAAPLGQIFVTTTG

CRDILVGKHFEVMPNDAIVCNIGHFDVEIDVAWLKANAASVQNIKPQVDRFLMKNGRHIILLAEGRLVNL

GCATGHSSFVMSCSFTNQVLAQIMLYKANDEAFSNKYVEFGKSGKLEKKVYVLPKILDEEVARLHLDHCN

VELTQLSDVQAEYLGLATEGPYKSDHYRY

>seq 3

MSSKPAFKVADLTLAEWGRKEIIIAEQEMPGLMAIRKKYGPQKILKGARIAGCLHMTVQTAVLIETLVEL

GAEVYGRPLNLILDDGGDLTNIVHDKFPKYLKECRGVSEETTTGVHNLYRMMKEGTLKVPAINVNDSVTK

SKFDNLYGCRESLIDGIKRATDIMIAGKVCVVAGYGDVGKGCAQSLKAFGGRVIITEIDPINALQAAMEG

YEVTTMDEASRKGHIYVTSTGCKNIITSNHFLNMPEDAIVCNIGHFDCEIDVAWLKQNAIEKVNIKPQVD

RYQLKNKRHIIVLADGRLVNLGCATGHSSFVMSNSFTNQVLAQIELWTKSKSYPIGVHMLPKKLDEEVAS

LHLDHLGVKLTKLTEDQAKYIGVPKEGPYKADHYRY 

 

(figure 8)

The best pattern found by PRATT will be automatically copies to ScanProsite (figure 9).  Select protein databases (default is Swiss-Prot) you want to search and click “START THE SCAN” button.  Check the result page and see what you find.      

                        (figure 9)

7.      Please go to KU-Tracker (http://www.bioinformatics.ku.edu/tracker/) and register as a user.

Click “Add Search”, select “Protein” from “Search Type” menu then click “next”.  You can select “BlastP” or “HMMER” or both.  Let us select both options.  Copy one of the sequences.  Type something in “description”.  Click “Domain OR Sequence” or “Domain AND Sequence” (figure 10).  Choose Domain OR Sequence to track the resulting domain and any matching sequences (returns all results). Choose Domain AND Sequence to track only those sequences that are similar to the entered sequence and also fall within the chosen domain (eliminates sequences that do not lie in the chosen domain).  It will take a few minutes to do the search.  The result is a list of possible domain your sequence matches (figure 11).  Select one or more domains and click “combine OR” or “combine AND”.  If you would like to monitor results for the entered sequence only if they are also results for any one of the checked domains, press the OR button (Union). If you want only results that are shared by the sequence and also each checked domain, press the AND button (Intersection). If you are only interested in one domain, choose either button.  Now your search has been saved and you will get results once every month.   

(figure 10)

            (figure 11)

8.      You can also keep tracking literature via KU-Tracker.  You can use PubMed to construct advanced query and click “Detail” to display the query.  You can copy and paste the query to KU-Tracker and it will do the search for you periodically.   

         (figure 12)