Problem
Set (some questions were modified from
http://www.ncbi.nlm.nih.gov/Class/FieldGuide/problem_set.html)
The Entrez Nucleotides database is a collection of sequences from several sources, including GenBank, RefSeq, and PDB. The number of bases grows at an exponential rate. As of April 2004, there are over 38,989,342,565 bases.
![]()
![]()
![]()
(figure 1)
Use Entrez nucleotides to retrieve the finished record AC009453 from the human genome project (figure 1). Try different formats (ASN.1, FASTA, XML, etc) to see what these formats look like. Answer the following questions:
What is the sequence? How many times has it been updated since it first appeared? Use “Check sequence revision history” in the left side navigation menu, trace the history all the way back to the first version. When did this record first appear? How many bases in the first version? How many in the current version?
The protein entries in the Entrez search and retrieval system have been compiled from a variety of sources, including SwissProt, PIR, PRF, PDB, and translations from annotated coding regions in GenBank and RefSeq.
BLink ("BLAST Link")
displays the results of BLAST searches that have been done for every protein
sequence in the Entrez Proteins data domain. 
(figure
2)
![]()
Best
hits is an display option which shows the top hit to each organism
represented in the BLAST results. The Best Hits display also shows the number
of hits to each organism. Click on the number to see only the hits to that organism.
(figure
3)
Defects in CFTR are the cause of cystic fibrosis (CF). CF is the most common genetic disease in the Caucasian population, with a prevalence of about 1 in 2000 live births. Retrieve the SWISS-PROT record for the human CFTR (cystic fibrosis) protein by searching with CFTR_HUMAN in proteins on the search box on the NCBI home page (figure 2). View the record and look at the extensive annotations.
How many primary database records are linked to this record?
How many literature citations are linked?
What is the function of the protein?
Use the FEATURES table to find the nature and location of the most common mutation in this gene in cystic fibrosis.
Now use the BLink link to retrieve related proteins, Click the Best Hits button and find the related protein from the fish Fundulus heteroclitus (figure 3). Follow the PubMed link from this record to read about the biology of this protein. What is the physiological role of this CFTR homologue in this animal?
3.
Proteins often
contain several modules or domains, each with a distinct evolutionary origin
and function. The Conserved Domain Database (CDD) and Search Service may be
used to identify the conserved domains present in a protein sequence. CDD currently contains domains derived
from two popular collections, Smart and Pfam, plus contributions from colleagues at
NCBI, such as COG. The source databases also provide descriptions and
links to citations. Since conserved domains correspond to compact structural
units, CDs contain links to 3D-structure via Cn3D whenever possible.
COGs -
Clusters of Orthologous Groups - natural system of gene families
from complete genomes. Clusters of
Orthologous Groups (COGs) were delineated by comparing protein sequences
encoded in 43 complete genomes, representing 30 major phylogenetic
lineages. Each COG consists of
individual proteins or groups of paralogs from at least 3 lineages and thus
corresponds to an ancient conserved domain.
![]()

(figure 4)
Find the genomic
scaffold AE003584 from Drosophila melanogaster using Entrez Nucleotide.
Click upright “Links” then choose protein to see the predicted proteins for
this scaffold (figure 4). (You will need to increase the number of records
displayed to see all of the proteins on one page. Then use the browser's
"Find in page" function to find the protein that you want.) Identify
conserved domains present in predicted protein AAF51293 by clicking on
“domains”. This conserved domain suggests a potential function for this
hypothetical protein. Now perform a search against the Prosite patterns using
the ScanProsite tool (http://ca.expasy.org/tools/scanprosite/) at ExPASy. Also try InterPro (http://www.ebi.ac.uk/InterProScan/).
4. Use Entrez Nucleotide to
find the full-length cDNA (mRNA) sequence for Plasmodium falciparum
glyceraldehyde 3-phosphate dehydrogenase (GAPD). This time start by typing
Plasmodium in the search box without limiting to any field. How many records do
you retrieve? Browse through your results to find some records that are not
from Plasmodium. Display a few of these to see why you retrieved them; you
should find "Plasmodium" somewhere on the record. Now use the Limits
tab to restrict to Plasmodium in the Organism field [Organism] (figure 5). How
many nucleotide records in Entrez are from Plasmodium? Now find GAPD records by
using the Preview/Index tab to add glyceraldehyde 3-phosphate dehydrogenase as
a [Title] Word. How many records did you retrieve?
![]()

(figure
5)
5. Inositol polyphosphate phosphatases contain conserved acidic
residues involved in binding metal ions. Retrieve the human INPP1 protein
(INPP_HUMAN) from Entrez proteins. Follow the
"Domains" link to display pre-computed CDD search results. Click on
the "Details" button to display the complete results. Follow the link
to the pfam inositol_P domain and display the domain in Cn3D by clicking on the
"View 3D Structure" button (figure 6). Identify the conserved
residues surrounding the magnesium ions by double clicking on them in the
structure. The corresponding residues will be highlighted in the sequence
alignment. You can annotate the side chains on these if you like. First change
the setting on the CDD page from "Virtual Bonds" to "All
Atoms" then display the structure. You can then use the Style->Edit
Global Style menu to turn off side chains and the Style->Annotate menu to
selectively turn on the side chains for amino acids that coordinate the
magnesium ions (figure 7). 
(figure 6)
![]()
(figure 7)
6. Suppose you have following protein sequences and you think
these sequences may share a motif.
Copy and paste the following sequences to PRATT (http://ca.expasy.org/tools/pratt/). Keep
“Directly submit best pattern to ScanProsite” checked
and then hit “submit Query” (figure 8).
>seq 1
MSTTSTPTATTAAFTDCHVRDLSLAAWGRKEMVIAETEMPGLMAIREEYAASQPLKGARIAGSLHMTIQT
AMLIETLTALGAEVRWASCNIFSTQDHAAAAIAAAGIPVFAYKGESLEEYWEFTHRIFEWHDGGTPNMIL
DDGGDATLLLHLGSDAEKDPSVVANPTCEEEQFLFAAIKKRLAEKPEWYSKTAAAIKGVTEETTTGVHRL
YQMHEKGRLKFPAINVNDSVTKSKFDNIYGCRESLVDGIKRATDVMVAGKVAVICGYGEVGKGCAQAMRG
LQAQVWVTEIDPICALQAAMEGYKVVTMEWAADKADIFVTTTGNINVITHDHMKAMKHNAIVCNIGHFDN
EIEVAALKQYQWENIKPQVDHIIFPDGKRIILLAEGRLVNLGCATGHPSYVMSSSFANQTLAQMELFCNP
GKYPVGVYMLPKELDEKVARLQLKTLGAMLTELTEEQAAYIGVPKAGPYKTDHYRY
>seq 2
MSAPAHKFKVADLSLAAFGRKEIELAENEMPGLMATRKKYAADQPLKGARIAGCLHMTIQTAVLIETLTA
LGAEVTWSSCNIFSTQDHAAAAIAAAGVPVFAWKGETEEEYQWCLEQQLIAFKDNKKLNLILDDGGDLTH
LVHTKYPEMLEDCFGVSEETTTGVHHLYRMLKEGKLLVPAINVNDSVTKSKFDNLYGCRESLVDGIKRAT
DVMIAGKIAVVAGFGDVGKGCAMALSGMGARVIVTEVDPINALQAAMAGYQVTTMEKAAPLGQIFVTTTG
CRDILVGKHFEVMPNDAIVCNIGHFDVEIDVAWLKANAASVQNIKPQVDRFLMKNGRHIILLAEGRLVNL
GCATGHSSFVMSCSFTNQVLAQIMLYKANDEAFSNKYVEFGKSGKLEKKVYVLPKILDEEVARLHLDHCN
VELTQLSDVQAEYLGLATEGPYKSDHYRY
>seq 3
MSSKPAFKVADLTLAEWGRKEIIIAEQEMPGLMAIRKKYGPQKILKGARIAGCLHMTVQTAVLIETLVEL
GAEVYGRPLNLILDDGGDLTNIVHDKFPKYLKECRGVSEETTTGVHNLYRMMKEGTLKVPAINVNDSVTK
SKFDNLYGCRESLIDGIKRATDIMIAGKVCVVAGYGDVGKGCAQSLKAFGGRVIITEIDPINALQAAMEG
YEVTTMDEASRKGHIYVTSTGCKNIITSNHFLNMPEDAIVCNIGHFDCEIDVAWLKQNAIEKVNIKPQVD
RYQLKNKRHIIVLADGRLVNLGCATGHSSFVMSNSFTNQVLAQIELWTKSKSYPIGVHMLPKKLDEEVAS
LHLDHLGVKLTKLTEDQAKYIGVPKEGPYKADHYRY

(figure 8)
The best pattern found by PRATT will be automatically copies to ScanProsite (figure 9). Select protein databases (default is Swiss-Prot) you want to search and click “START THE SCAN” button. Check the result page and see what you find.
![]()
![]()
![]()

(figure
9)
7. Please go to KU-Tracker (http://www.bioinformatics.ku.edu/tracker/)
and register as a user.
Click “Add Search”,
select “Protein” from “Search Type” menu then click “next”. You can select “BlastP” or “HMMER” or
both. Let us select both
options. Copy one of the
sequences. Type something in
“description”. Click “Domain OR
Sequence” or “Domain AND Sequence” (figure 10). Choose Domain OR Sequence to track the resulting domain and
any matching sequences (returns all results). Choose Domain AND Sequence to
track only those sequences that are similar to the entered sequence and also
fall within the chosen domain (eliminates sequences that do not lie in the
chosen domain). It will take a few
minutes to do the search. The
result is a list of possible domain your sequence matches (figure 11). Select one or more domains and click
“combine OR” or “combine AND”. If
you would like to monitor results for the entered sequence only if they are
also results for any one of the checked domains, press the OR button (Union).
If you want only results that are shared by the sequence and also each checked
domain, press the AND button (Intersection). If you are only interested in one
domain, choose either button. Now
your search has been saved and you will get results once every month.

(figure 10)

(figure 11)
8. You can also keep tracking literature via KU-Tracker. You can use PubMed to construct
advanced query and click “Detail” to display the query. You can copy and paste the query to
KU-Tracker and it will do the search for you periodically.
![]()

(figure
12)