Thursday, April 19, 2007

Bioinformatics Practical -guidelines

This practical begins with an "unknown" fragment of DNA, such as a molecular biologist might have deduced from a gel. You are invited to discover the protein family to which your fragment corresponds, and its biological significance, by searching the primary and secondary sequence databases, and examining the fold classification systems.
You may, of course, run through the exercise with a sequence of your own.
Within the text, highlighted phrases link to pages & pictures that explain the practical in more detail. Pictures are marked by blue blobs (& more important ones with red ones ). If over- seas connections get stuck in traffic, use the Stop button to break the connection, and try again, or move to a local facility (page locations are echoed in the bottom border). To kill unwanted images, double click on the button in the top left- hand corner . For further background to the practical, use info.

Sequence translation & identification
From now on, it is necessary to work with several windows. Retrieve your DNA sequence from the Materials frame & paste into the input box of the Translator. To discover the correct translation, click on ...go? in the fragment window & search the OWL composite database; repeat this process until you find a hit.

Commentary
The aim of the initial part of the analysis is to translate an "unknown" fragment of DNA and to identify the correct reading frame. The quickest approach is to perform rapid identity searches of a composite database, using each of the translations (see info for further information on composite databases). Clearly, some translations will contain stop codons (denoted by !), so can be ignored. Others can be pasted into the database search form, which will return exact matches to your peptide in seconds.
Please note, output from this and all subsequent database searches will be sent to different (left- and right-hand) pages, so do not fill your screen with the browser, otherwise your results will be obscured.
A "real" example is unlikely to have returned a match at this point, so the analysis would have to be taken further. This means performing an exhaustive similarity search of a composite database, which is the next stage of the practical

Primary database searches
You should now know the identity of your fragment. To obtain its full sequence, paste a code from your search results into the database Query form. To discover if your sequence has relatives, paste part of the sequence (~60 residues) into the PSI-BLAST input box (edit out non-amino acid characters).

Commentary
The aim here is to retrieve a full sequence from the database, of which your fragment is a part (you may need a reminder of the single-letter amino acid code?). The idea is to perform an exhaustive similarity search of the composite database(s) in order to discover any homologues. The tool used here is BLAST (see info for further information).
It is usual to submit a full sequence to BLAST, but for now, to reduce the load on the server, a portion of the sequence should suffice: e.g. ~60 residues should take ~30 seconds to run. If the server is busy, please be patient. When the search is complete, note the high-scoring matches . Note that US connections from the UK may be slow. If so, other BLAST services are available, e.g., for OWL, SWISS-PROT and TrEMBL. (In some BLAST forms, remember to switch the program to blastp before proceeding).

Exercise 1
Six-frame sequence translation
Having translated your fragment, which was the correct reading frame?
Fast sequence identity search
How many exact matches did this rapid OWL search identify? Did the result allow you to identify the precise sequence to which your fragment belongs? If not, why not? What, then, is your sequence, or, if you couldn't identify it exactly, what is its family?
Sequence similarity search
How many possible family members did the BLAST search of nrdb reveal? Does your protein belong to an extended family, or is the family quite small? Were there any borderline results that were difficult to diagnose? How did the result differ, if at all, from a BLAST of OWL, and from those of PSI-BLAST? Did you see any differences upon changing the scoring matrix?
If you didn't keep your results and can't remember the answers, repeat the relevant steps of the practical. If necessary, discuss your results with a demonstrator before going further.

Secondary database searches
To locate known sequence patterns, paste the ID code (or full sequence) into the ScanProsite form; repeat for Profiles, Pfam, eMOTIF & BLOCKS databases (to get your sequence again, supply the ID to the OWL search form, paste the result into the BLOCKS form & remove any non-sequence characters).

Commentary
The aim here is to learn about the function of your sequence by searching the secondary databases. We first search patterns in PROSITE - use the filter button in the search form, otherwise results are likely to be noisy. Judge the significance of any matches in the context of searches of the other databases.
We now examine sequence profiles in the Profile library. The form allows searches both of profiles currently released via PROSITE, and of pre-release entries that lack annotations. Next, we consider fuzzy regular expressions in the IDENTIFY resource. Results are output at different levels of stringency, allowing you to assess their signficance; matches link back to PROSITE and PRINTS for their annotations.
Finally, we turn to the BLOCKS databases. In the output, scroll down to the Blksort Hits heading and note any good matches. Repeat the process for BLOCKS-format PRINTS - return to the form, switch the Select database button to Prints database, and search as before.

Exercise 2
Searching for patterns
How many matches did you retrieve from PROSITE? Did you remember to use the filter to exclude frequently occurring patterns? What difference does this make? Are the matches significant? What does the matched pattern/s that you judge to be genuine tell you about the possible function/s of your sequence?
Searching for profiles and HMMs
Were your searches of profiles and Pfam successful? If so, are the results consistent with the results of searching PROSITE? If you got no matches, why might this be so?
Searching for blocks
How many matches did you retrieve from BLOCKS? How many of these are significant (how do you judge significance)? How does the result compare with those from PROSITE, profiles and Pfam? Is there a difference between the locations of the matched patterns and blocks? If so, why? Is a consistent picture being built up by the different searches? Draw a diagram to indicate where the various matches occur.
If you didn't keep your results and don't know the answers, repeat the relevant steps of the practical. If necessary, talk to a demonstrator before going further.

Protein fingerprinting
To see if any fingerprints match your sequence, supply its ID code to the finger- PRINTScan form. To view a database entry, type its code into the PRINTS search form, or click on the hyperlinked output. Visualise an alignment by clicking on View alignment, which invokes an interactive alignment editor.

Commentary
The aim here is to discover if your sequence matches any known fingerprints in the PRINTS fingerprint database. Results are returned in tabular form; to visualise individual matches, click on the View graphic option (you may have to specify a filename for use with an external Postscript viewer).
Having examined a range of pattern-recognition methods, is there a consensus between them? Is there correspondence with the BLAST result? If several methods suggest similar matches, this adds weight to your diagnosis. Consider how results of searching patterns, profiles, blocks, fingerprints, etc., differ.
If you have matched a fingerprint, examine the full database entry, within which there are links to other databases (BLOCKS, PDB, scop, etc.) and analysis tools. Explore some of these links, and try retrieving the source alignment from which your fingerprint was derived (this invokes the CINEMA alignment editor, which will only work if your browser supports Java (we will return to CINEMA later)).

Exercise 3
Searching for fingerprints
How many matches did you retrieve from PRINTS? Are the matches significant (how do you judge significance)? If you have matched a fingerprint, how many motifs does it contain? How does the result compare with those from PROSITE, profiles, Pfam and BLOCKS? Does BLOCKS report the same number of conserved regions? Is there a difference between the locations of the matched motifs and those of matched patterns or blocks? If so, why might this be? Again, is this result building a consistent picture?
Information retrieval from PRINTS
To what other databases is the PRINTS entry linked? How many sequences match the fingerprint completely? Are there any partial matches? If so, why might this be? Upon what version of OWL was the fingerprint derived? If there is a PROSITE equivalent, how many sequences match the pattern? Is the number of matches different? If so, why might this be so? Does the PROSITE entry report any false matches? Depending on your answer, how "good" do you think the regular expression is compared, say, with the fingerprint (or profile, or blocks)? Why might the other methods perform better?
If you didn't keep your results and don't know the answers, repeat the relevant steps of the practical. If necessary, talk to a demonstrator before going further.

Sequence alignment
To view an alignment of your family, supply the PRINTS ID code to the ALIGN query form, select either Postscript or GIF output and Send Query. Also look for alignments at the iProClass or ProDom servers. Try CINEMA only if your browser supports Java; otherwise, experiment with CLUSTALW.

Commentary
The aim here is to visualise an alignment of your family, e.g. using the ALIGN compendium of PRINTS seed alignments (try both PostScript and GIF output - your browser may not succeed with both). Try also the MIPS & ProDom Web servers (these sites generally offer only static views of alignments and do not allow interactive manipulation).
A better option is to explore your alignment using an interactive editor, such as CINEMA (to learn how to operate the editor, please consult its help file). Familiarise yourself with the controls and use the program to customise your alignment: e.g., add or delete sequences, change the display colours, etc.. If possible, create an alignment for your report.
Now try running some sequences through an automatic program, such as CLUSTALW. Paste your sequences into the input box (remember to use an accepted format). Compare results from CINEMA. [Nb: CINEMA is having problems with newer versions of Netscape and Explorer - we're trying to fix this, so please be patient.]

Exercise 4
Multiple sequence aligment
Analyse your alignment and pinpoint the areas corresponding to the various conserved regions found by PROSITE, BLOCKS and/or PRINTS (include information from the profile library and Pfam where relevant). Annotate the alignment accordingly. Is there any structural/functional significance in the locations of these motifs?
Automatic sequence alignment
Compare your manual and automatically-generated alignments. Do the results differ? If so, in what way or ways? For example, are there features that might indicate that one of them has been generated automatically? If so, what are the clues?
If you didn't keep your results and don't know the answers, repeat the relevant steps of the practical. If necessary, talk to a demonstrator before going further.

Sequence property profiles
To create physicochemical profiles (hydropathy, solvent-accessible SA, etc.), paste the sequence into this input box. For a secondary structure prediction profile, use the NPS@ form . Compare results with nnpredict.

Commentary
The aim here is to analyse your sequence in terms of a range of physicochemical parameters (hydropathy, flexibility, solvent- accessible surface area, etc.). The form allows comparison of three hydropathy scales, one of which is specific for trans- membrane proteins. Do the different scales give similar results? Does a change of window length significantly alter your result? (Note: when using this facility, you may have to specify a filename for use with an external Postscript viewer).
This form also allows you to view a Garnier-Osguthorpe- Robson secondary structure prediction profile (text output is also available to make interpretation of the result a little easier). For comparison, we suggest using nnpredict, which allows you to specify the tertiary structure class of your query sequence (but note that there are many other prediction packages available). How similar are the results from these methods, and how reliable do you think they are likely to be?
Exercise 5
Plotting physicochemical profiles
Having constructed plots of a variety of physicochemical parameters (hydropathy, flexibility, solvent-accessible surface area, etc.), do you observe a significant difference between the various hydropathy scales? What conclusion can you draw about the locations of the most flexible parts of the sequence?
Secondary structure prediction
From examination of your prediction profiles, how accurate do you think they are, and why? Are there any differences between results from GOR and nnpredict? What and where are they? How might these predictions be improved? Keep these results and compare them with known structural data, which you'll find in the following pages.
Creating regular expressions
Analyse the conserved regions highlighted in your sequence alignment. Choose one of these and create your own regular expression (using a different region from that encoded by PROSITE!) - further details on how to do this are available on the regular expression search form. Are the results of your pattern search consistent with those indicated in PROSITE, BLOCKS and/or PRINTS? How and why might they differ?
If necessary, talk to a demonstrator before going further.
Exercise 6
Sequence analysis challenge
So far, you've worked on an easy example, which was diagnosed immediately. Often, circumstances are not so kind, and your sequence might live in the Twilight realms of similarity. To test your deductive powers further, you're now faced with a challenge.
Some mystery sequences
Below is a list of mystery sequences. Choose one and apply your analytical skills to determine what the protein might be and what conserved sites it might contain.
Sequence 1Sequence 2Sequence 3Sequence 4Sequence 5Sequence 6

Don't expect a positive diagnosis at once. Cross-check your findings using a range of databases, starting here, and support your results with a sequence alignment. Alternatively, you might want to try a shortcut. Good luck!
Discuss your diagnosis with a demonstrator before going further.

Fold classification
If your sequence has a known structure, examine its summary file by supplying its PDB code into the Query form - click on the CATH button to go to the CATH database. Alternatively, examine the SCOP database by entering a PDB code into its Query form. Compare results from SCOP & CATH.

Commentary
The aim of the last part of this analysis is to learn about the structure of your protein, if its 3D coordinates are available. It is convenient to start by examining the PDB summary files provided by Roman Laskowski's PDBsum resource: this supplies the Sequence and secondary structure of the protein; Schematic and Raster3D images of its ligand, if it has one, and its protein-ligand interactions. Clicking on the picture at the top left of the structure summary provides an overview. PDBsum provides links to the CATH structure classification database, which you should explore in detail.
An alternative classification resource is SCOP. This may be queried either by supplying a PDB code directly, or by means of a text string. It is of interest to compare the particular classification offered in scop with that given by CATH to verify whether the results are similar.

Exercise 7
The CATH classification system
What is the CATH number that classifies your protein family? Explain what the classification means in terms of its Class, Architecture, Topology and Homology.
Visualising protein structure
Does your protein have an associated ligand? If so, which residues in the sequence interact with the ligand? Referring to your sequence alignment, are these residues conserved? Would you expect them to be? Do any of them lie in the motifs defined by PROSITE, BLOCKS or PRINTS?
The scop classification system
How is your protein classified in scop? Does the classification differ from that given by CATH. If so, why might this be so?
If you didn't keep your results and don't know the answers, repeat the relevant steps of the practical. If necessary, talk to a demonstrator before going further.

No comments: