JAMF Archive

BioCompanion as published in 1995
THIS IS THE REFERENCE CODE AS PUBLISHED.
		Doelz, R.   
		Optimal production of biological documentation: the JAM format.
		Comput. Applic. Biosci. 11, 224-226 (1995).    
		
The version you are currently viewing is the one printed and distributed via the Internet from the server of BioComputing Basel. Version 3.1 of the BioCompanion was published with version 2 of the JAMF software. The server that was indicated in the documentation has ceased to exist.

Version 3.2 of the BioCompanion was not publicly available for free but was shareware that was distributed with GCG's software release 9. For the purpose of enhanced editing, JAMF was partially rewritten and the proprietary version 3.x of JAMF was used from 1996 onwards. The Biocompanion is available in a current version from the publisher . It has significantly changed both in software and content.

JAMF source code

LATEX version source code

	

location: Home > Archive > BioCompanion V2.x (1995)

Chapter 8: HowtoHandleaSingleSequence

How to Handle a Single Sequence


Prerequisites for all examples and instructions

The following description assumes that the set up for the GCG package has already been completed. These sequences used for input must be in GCG format . Use reformat or genmanual sequence_exchange for details on how to convert sequences into GCG format.

Options for result display

Text Output: Many programs ask for an output file. This is mostly letter-by-letter output. You can review this text with the (command line) command type/page . WPI users may use the display function in the "output manager window".

Graphic Output: To display graphics, you need to tell the GCG software that you want to use either display (=screen) or printer (=hardcopy). Command line users must be sure that everything is fine, to achieve this, initialise the display with the command setplot and select the option which seems best suitable to you. If you select X-Windows , you should

NOTE: A window should come upon your screen after the selection.

WPI users may use the display function in the "output manager window".

Check that display or printer work properly when you use them the first time and produce a test graphics with plottest .

NOTICE

This is the last time that these details are described. The following chapters of the BioCompanion assume that all these setup operations have been successfully completed.


Composition-Counting Programs

Principle

A very important prerequisite of biological sequence is a defined alphabet which lists the allowed symbols and their meaning. The DNA alphabet is rather simple at the first glance: A,G,C,T,U,N (any). However, in order to express common properties in between nucleotides, the IUPAC has defined so-called "ambiguity symbols" which allow to name with the letter S either G or C character.

================================= Begin Exercise 4

A small hunting exercise: Find the DNA alphabet.

In order to use biological sequences, the computer utilises a defined alphabet which assigns nucleotides or amino acids to single letters. These assignments are written in tables. The purpose of this exercise is to find the IUPAC table for nucleotide symbols. Proceed as follows:

================================= End Exercise 4

The characterisation of a biological sequence can be achieved by counting the composition. It does, however, matter very little if you know that your sequence contains a certain number of residues as you want to correlate this with either other residues or other sequences. Therefore, you need to normalise the numbers. Two procedures are applied:

Detailed View on the "windows" Technique

Consider the following sequence:

 
  
    tgatggtcaagtaaactatgaagagtttgtacaaatgatgacagcaaagtgcgaagac  
  
This sequence fragment has a length of 58 base pairs. If you add the numbers for G (15) and C (7), you end up with a total of 22. Your sequence, therefore, has a G/C content of (22/58*100) = 38%.

Next, let us analyse this sequence with a window of the size 8. This window is symbolised as |------| in the plot below. We count the composition in the first fragment - tgatggtc - three G's and one C. This corresponds to a total value of 4, and we enter this in the middle of our window of 8, which is at position 4.

 
       ^  
no. of | 8     
G or C | 7      
found  | 6  
in 8   + 5  
       | 4     x     
       | 3         
       | 2                 
       | 1                 
       | 0   
           tgatggtcaagtaaactatgaagagtttgtacaaatgatgacagcaaagtgcgaagac      
           ----+----+----+----+----+----+----+----+----+----+----+------------->  
               5   10   15   20   25   30   35   40   45   50   55   sequence  
  
           |------|   --> moving this window of 8 along the sequence  
Our window started at position 1. We then shift our window along the sequence in the increment of 4 (1 were possible but we use a larger increment here in order to reduce work). This means that the starts now at position 5 and we will plot at position 5 + 8/2 = 9, or expressed as formula, [start of window] + [size of window] divided by 2. The second window, therefore, is ggtcaagt which has three G's and one C. We plot this result at position 9, 4 of our graph:
 
       ^  
no. of | 8     
G or C | 7      
found  | 6  
in 8   + 5  
       | 4     x   x  
       | 3         
       | 2                 
       | 1                 
       | 0   
           tgatggtcaagtaaactatgaagagtttgtacaaatgatgacagcaaagtgcgaagac      
           ----+----+----+----+----+----+----+----+----+----+----+------------->  
               5   10   15   20   25   30   35   40   45   50   55   sequence  
  
               |------|   --> moving this window of 8 along the sequence  
Continuing, the next window starts at position 13 (9 plus the increment of 4) and has the composition aagtaaac . This time, the number of (G or C) is two and we plot at position 13,2:
 
       ^  
no. of | 8     
G or C | 7      
found  | 6  
in 8   + 5  
       | 4     x   x  
       | 3         
       | 2             x       
       | 1                 
       | 0   
           tgatggtcaagtaaactatgaagagtttgtacaaatgatgacagcaaagtgcgaagac      
           ----+----+----+----+----+----+----+----+----+----+----+------------->  
               5   10   15   20   25   30   35   40   45   50   55   sequence  
  
                   |------|   --> moving this window of 8 along the sequence  
You might want to complete the plot yourself. The result of such a plot is that you will visualise the G/C richness of the sequence as function of the sequence which allows conclusions on the functionality of this DNA fragment.

This technique is not restricted to DNA sequences. However, there are no default symbols of the protein alphabet as all amino acid symbols (20) require the whole alphabet. The trick is to change the sequence artificially; you will try this in an exercise later .

================================= Begin Exercise 5

DNA composition: Determine the G/C content of a DNA sequence as function of the sequence.

In order to determine the G/C content, follow this schedule:

================================= End Exercise 5

Programs

NOTE: Programs which produce graphics are marked with an asteriks (*).

Effect of the Window Size in the 'window' Program:

The larger the window, the more detailed will be the curve result as the number of patterns found or not found in the given sequence will increase. E.g., a window size of 30 will allow up to 30 occurrences of "S", whereas a window size of 5 will only have five different values.

The smaller the window, the more precise will be the location of a given effect. Values computed for a given window will be plotted at the middle of the window. A window of 30 has an uncertainty of fifteen.

EGCG Programs

If you have the EGCG programs installed, you might want to use following programs:

Use the egenhelp of these programs for more details.


Reading Frame Estimation Programs

Principle

What is a reading frame? Mother nature will know. For the computer, a reading frame is any stretch of a sequence which starts with a start codon end ends with a stop codon. In between, the protein is assumed. The situation is rather trivial if we analyse cDNA samples. However, keep in mind that side effects (badly controlled library generation etc.) and sequencing errors might also be an issue in cDNA analysis.

Genomic Sequence Analysis: Detection of Coding Regions

The systematic sequencing analysis of genomes will result in long sequences which are unknown whether these translate at all into a protein. Therefore, one of the prime targets of genomic sequence analysis will be to spot the location of splicing sites, coding regions, and intron/exon boundaries.

NOTE: It is important to realise that, due to the complexity of the matter, no computer analysis is perfect. The methods available perform a PREDICTION which may not be reliable. Results require experimental validation or other supportive data.

One approach is to analyse the sequence and analyse the regularity of occurrence of the nucleotide patterns. It has been shown that, in a reading frame, certain patterns will occur in periodic fashion. The detection of such patterns in a relatively large range (window of 300 bp and more) is the operational hypothesis of the 'testcode' program in the GCG package.

Other programs will be more powerful and use a defined set of patterns as a 'learning set'. Due to the restrictions of patterns , the programs apply very sophisticated methods which go far beyond pattern matching. Keep in mind, however, that most of these prediction programs will be severely restricted to the species which has been used to create the programs. You are encouraged to carefully study the documentation of the gene prediction programs to check whether these are applicable to your problem.

As prediction programs operate with statistical methods, results figures are frequently expressed as 'probability'. Unfortunately, more than a single pattern set or sequence motif is required to build the prediction, and many programs express more than one number. E.g., a given algorithm might predict a reading frame with a 80% probability, but with this probability threshold only 66% of the test cases are predicted correctly. Therefore, you are also encouraged to try the program of choice with several well-known examples which are similar to your unknown sequence in order to access the numerical figures with a better knowledge.

Programs beyond GCG are not currently widely supported but exist. Some mail and WWW servers in the Internet offer tools to predict gene structures.

The explora program will predict genetic models of yeast sequences, and the genefinder program suite collection allows the prediction of genetic features in human, drosophila and nematode systems. These programs run via the Hierarchical Access System for Sequence Libraries in Europe (HASSLE) and are specifically adapted for use within the GCG package at the BioComputing facility in Basel.

Comparison of Codon Frequencies

Amino acids like methionine or tryptophane use a single triplet of bases for coding. Other amino acids, like serine, use up to six different codons, and these codons may theoretically be used equally well. Highly expressed genes, however, frequently show a preference for a certain codon, while other codons are rarely used or not utilised at all ("rare codons").

If we analyse a reading frame and detect it schematically (like in the GCG program 'frames'), it is possible to determine the codon usage within the predicted reading frame as the start and stop codons are "known". As we know the expected codon usage from other genes, we may compare the two and obtain a numeric value which is either supportive or will possibly suggest that the reading frame will not be expressed in vivo due to the unfavourable codon usage. Using a window of several codons (such as a stretch of 25 or more codons), statistics might be significant enough to even spot reading frame errors as the comparison curve for the codon usage will decline numerically at the point of error. Comparing several alternatives, a decision for a reading frame is theoretically possible. The 'codonpreference' program of the GCG package uses this approach. Refer to the explanations above for details on window techniques.

A more pragmatic approach analyses to determine possible reading frames is to compute the usage of G or C in the third base of the predicted codons. The value, expressed as GC bias, is meaningful in similar fashion as the comparison curve for the codon preferences, and is also plotted by the GCG program 'codonpreference'.

NOTE: Predictions based on the comparison of codon usage are not applicable or at least negatively affected if

1) no codon bias is observed at all (weakly expressed genes)

2) reading frame errors occur repeatedly

3) exons are not removed by sequence editing before analysis

Programs

NOTE: It is assumed that you have completed all the setup operations .

The reading frame, obviously, must be known for the purpose of reading frame prediction based on codon usage with the 'codonpreference' program. The GCG program provides tables for various organisms:

 
  
organism      table name  
---------------------------------------------  
  
human         genmoredata:human_high.cod  
fruit fly     genmoredata:drosophila_high.cod  
yeast         genmoredata:yeast_high.cod  
plants        genmoredata:maize_high.cod   
  
(default:     genrundata:eco_high.cod  
                                        E.coli highly expressed genes)  
  
You can compile your own codon frequency table with the program codonfrequency (see above ). If you want to see just the codon preference, or use monochrome terminals only, use the command

$ codonpreference/nobias

The GCG package program 'findpatterns' will use patterns from any database of patterns or even typed-in patterns in order to locate these in your sequence. One application of this program is to search the transcription factor database from D.Gosh. This database will be used if you type

$ findpatterns /dat=genmoredata:tfsites.dat

NOTE: The application of patterns in DNA analysis is, due to the complexity of the matter, very restricted. Transcription factors as listed in the database above are examples and not really exploited patterns. This will result in many "false positives". See below for a discussion.

================================= Begin Exercise 6

DNA reading frame analysis: Determine the reading frame of the DNA sequence GENEMBL:M19311 and compare the result with the annotation of this database sequence.

To solve this problem, follow this schedule:

================================= Begin Exercise 6


Restriction Enzyme Mapping Programs

The analysis of a DNA sequence to estimate composition or codon region was based on little auxiliary data. If we want to detect possible cleavage sites in a biological sequence we need to have the known sites listed in a database. In contrast to the codon usage tables, which are systematic and complete, restriction enzyme tables need to consider different sites, including variances.

Principle of Patterns

In order to define a pattern in the nucleotide alphabet, the use of ambiguity symbols is a good way to allow several different symbols to be used at one position. Proteins, however, will need a different mechanism. The definition and properties of patterns are described in a later section of the BioCompanion. Briefly, the restriction enzyme cleavage sites are described in a format called a pattern with the following properties:

NOTE: This type of program assumes that cleavage and binding site are extremely close to each other. The programs using patterns to describe restriction enzymes are NOT usable for other purposes unless explicitly mentioned.

Limitations of the Pattern Approach in DNA Analysis

Patterns in DNA are known by example mostly. Very little is known on detailed properties (such as promoter requirements). Look at the following example. The pattern language for a simple promoter, such as

 
   
TATA box, about 30 to 300 less important base pairs, and the start codon  
  
will read in a pattern language as
 
   
ATG(N){30,300}ATG  
  
However, the ATG as required in the pattern must not be any methionine, but the start codon. Therefore, it depends very much on the input sequence which is used for comparison in the pattern analysis whether the result of this comparison is of use or not. Most genetically important elements are, unfortunately, only known as example. Therefore, if a general pattern is derived from these examples, we risk that many comparisons of the pattern to an input sequence are computationally correct but biologically irrelevant. Therefore, the straightforward application of patterns is valid for restriction mapping, but will be problematic for genetic motifs.

Using the 'prime' Program to Predict Primers in a Pattern Approach

The program prime can predict "good" primers from a given nucleotide sequence. Note that the use of this program does only suggest multiple primers; the user has to evaluate suitable positions from the output. The program 'prime' computes a text output and a graphic overview which is suitable to identify regions of good primers; as usually the first top hits are located in only two or three regions rather than being equally dispersed on the entire region of interest. The 'prime' program has some limitations, as it should not be used to predict primers with a target of more than a certain length and a certain maximum length for each region.

Other software packages, specifically, PC-type applications, might be worth considering if you use primer predictions frequently.

Principle of Restriction Enzyme Mapping in a Pattern Approach

Restriction enzymes will cleave DNA sequences at certain positions. A program which analyses such cleavage sites will, therefore, compare the entire DNA input sequence versus a database of enzymes and locate matches of the DNA sequence and the binding site of the enzyme as described in the database. The output of the programs will print the location of the cleavage sites either schematically (an overview plot, as graphics), or analytically (printed sequence and restriction enzyme cleavage sites). The output of the latter is the most detailed view, however, overloaded with information and occasionally too crowded. Therefore, it is possible to exclude enzymes from the display even if they would theoretically match. The criteria for this exclusion can be the following:

If the size of the fragment matters, programs are available which will display the fragments sorted by size rather than by cleavage position. For this purpose, it matters whether the sequence is circular or not (such as plasmids: One cut will not result in a size difference).

Last not least, the GCG program package provides a functionality for drawing plasmids with their cleavage sites.

Programs

Useful options to 'map', 'mapsort', 'mapplot':

 
  
/once                 only 1 cut in entire sequence  
/sixbase              only sixbase cutters  
/exclude=200,500      do not consider enzymes cutting between 200 and 500  
/mincut=2             exclude enzymes cutting once or not at all  
/maxcut=1             select only enzymes cutting once  
  
mapplot only:  
/noplot/out=my.txt    suppresses plot and creates text file my.txt instead  
/double               doubles height of characters in graphics mode  
  
mapsort only:  
/plasmid              to create  *.tick file as input to plasmidmap  
  
Plasmid Drawing

The program plasmidmap (*) reads a *.tick file generated by the mapsort program used with the plasmid option. To get started, you might want to fetch the example files and try it with these:

$ fetch pgamma.*

$ plasmidmap @pgamma.fil

Further information is available in printed form, and it is highly recommended to review this documentation before you spend extended time periods with the programs. Alternatively, on-line help is available. To get started, use the command genhelp plasmidmap description .

NOTE: The GCG software package graphics cannot be easily transferred into PC type of graphics in version 8.x of the package. Encapsulated postscript will be an option for high-quality prints in combination with manual reinking if required. Public domain and commercial software packages might suit the purpose better than the 'plasmidmap' program from GCG. Before you investigate these alternatives, however, please make sure that the effort is worth.


Translation

DNA to Protein

Two programs should be run, one after the other. The first is needed to determine the reading frame. If you know it already or if you ran the corresponding analysis programs ( frames or similar ) you can immediately proceed to run the second program

% translate

Note that you might want to reverse the sequence before translation. The second option is to use the program map with the corresponding translation options, and afterwards extract the corresponding peptides from the output with

% extractpeptide

Translation of Genomic Sequences

The translation of genomic sequences requires that, before running the program translate , you know the intron/exon borders. Without this knowledge, erroneous sequences will be the result. Unfortunately, the availability of programs to detect these genetically relevant sites is very limited and, if possible at all, limited by the reliability of the predictions of computational models. The GCG program package does not currently support this type of prediction.

Translation of Database Sequences

In the DNA sequence databases, entries of genetic origin will frequently cross-reference the protein sequence. This saves you a translation as you may use the protein sequence directly.

If this is not true or if you do not have the protein sequence database available locally, DNA sequences of genetic origin occasionally show CDS features which describe the position of reading frames and the corresponding intron/exon boundaries. The translate program will allow to translate one after the other. Alternatively, the WWW browser of the SRS system will allow to click on the peptide feature and translate the sequence automatically. In order to get this sequence into GCG format, you might use the mouse and highlight the sequence (and only the sequence). Next, copy the sequence into the paste buffer (use the pull-down of the <Edit> menu). Then, on the command line, you give the command (as an example, for the sequence my.seq)

$ create my.seq

and, subsequently, you paste the contents into the sequence (again, by using the <Edit> pull-down). What you have done is to open a file with the create command and you have appended the text into this file. Therefore, after the paste, the file is still open. You need to close it accordingly by typing <CTRL><Z>.

Next, you need to reformat the file to GCG format. As it is plain text, it may complain about a missing ".." divider but, this should not matter.

NOTE:

1) You need to be sure that you copy only the sequence.

2) The WPI interface is not useful for this trick.

3) Apply manual checking whether you succeeded (is M at position ?)

4) Make sure that no stop codons (indicated by "*") are present in your sequence.

Translation of Mitochondrial Sequences

Be aware that translation requires a table which contains the amino acid symbols resolved to the individual codons. Some sequences might have other translation patterns. The GCG software offers these different tables. Refer to the genhelp section on the translate program.

Protein to DNA

The translation from amino acid code to DNA requires a correct codon usage table . The default table might not be suited for detailed analysis. To get an organism-specific codon usage table, refer to corresponding section of the BioCompanion, or compile your own one from an existing (set of) sequence(s) with the program

$ codonfrequency

To use a specific table to translate DNA into protein, use

$ backtranslate your.seq codon.file

e.g.,

 
  
backtranslate hp7764.pep drosophila_high.cod  
  
The second file name will be assumed to be the codon file. Examine the result using the methods described in the file handling section .

DNA to RNA and Vice Versa

The change of T to U and U to T can be done with the reformat program:

$ reformat/DNA

or

$ reformat/RNA

Similarly, the case of sequence characters can be changed with the reformat program by using the options tolower and toupper, respectively.

If problems occur because of a wrong sequence type assignment, you need to reformat the sequence specfically with type 'NUCLEIC' or 'PROTEIN', respectively.


Protein Tools

There are various tools available which allow you to analyse single protein sequences.

Secondary Structure Prediction

Principle

The desire to predict a secondary or even tertiary structure from the amino acid sequence is known as the folding problem. Unfortunately, there is no solution available at this point of time. Two approaches are in use:

Programs for secondary structure prediction

Remember that the prediction of secondary structure without a reasonable homology to three-dimensional data is rather unsafe. Programs which employ three-dimensional modelling techniques require special hardware (powerful computers) and dedicated software, hence, are beyond the scope of the BioCompanion .

The programs available to you in the desktop environment wil typically be restricted to secondary structure prediction from scratch. In order to display the secondary structure plots, you need to have a computer screen which is capable of displaying graphics. It is recommended that you have access to a colour graphics device if you want to run these programs. Remember to have set the graphics environment correctly with setplot if you work with GCG locally. X-Windows setups must have set the DISPLAY environment correctly.

To display several measures of secondary structure, use

$ pepplot

To generate a table of several measures (with a comparison of Garnier-Robson and Chou-Fassman predictions), use

$ peptidestructure

The generated output file can be plotted "two-dimensionally", but for serious inspection the one-dimensional plotting is recommended (use the corresponding menu option):

$ plotstructure

EGCG Programs

If you have the EGCG programs installed, you might want to use sigcleave , helixturnhelix and antigenic for the analysis of peptide sequences. Use the egenhelp of these programs for more details.

Visualisation of Secondary Structure

Given the assumption that the protein fragment adopts a helical structure, the program helicalwheel can be used.

The program moments plots a three-dimensional map which displays moments of hydrophathy in dependence of the sequence and the rotational angle of the peptide bond (90 - 110 degrees is OK for helices, 0 or 180 degrees is indicating chances for beta sheet).

EGCG Programs

If you have the EGCG programs installed, you might want to use pepcoil and pepne for the analysis of peptide sequences with aliphatic edges, and pepwheel for analysis similar to "helicalwheel" as described above. Use the egenhelp of these programs for more details.

Fragmentation

The programs peptidemap and peptidesort work like the DNA counterparts .

Isoelectric Point

The isoelectric point of the denatured protein can be determined from the titration curve plotted by the program isoelectric .

Simplification of Protein Sequences

Frequently, you might want to know where "acidic" or other regions of your protein sequence are located. As ambiguity symbols in the single-letter peptide alphabet are not defined, you might rewrite your sequence and use the window program in order to plot the result with statplot . The data for the simplify program are located in a file which you can get from the GCG program database with the command

$ FETCH SIMPLIFY.TXT

This file has a self-describing format, and basically will replace each amino acid listed in the second column with an amino acid listed in the first column:

 
  
 D   DEQN  
  
will make all D, E, Q and N symbols convert to D. This might look biologically irrelevant but a good approach to get all acidic amino acids to read "D" - as these can be plotted now with the 'window/statplot' programs.

================================= Begin Exercise 7

Summary of single-sequence tools: Translate the sequence GENEMBL:M19311 in the determined reading frame, perform a secondary structure prediction from scratch, and plot the acidic amino acids as function of the sequence.

To use amino acid sequences, the computer needs a defined reading frame in the DNA sequence which allows the translation into a peptide sequence. The translated amino acids are written into a peptide sequence. The purpose of this exercise is to create the sequence M19311.pep and predict its secondary structure. Proceed as follows:

================================= End Exercise 7


Hints on additional software

Many additional single sequence analysis programs are available as public domain programs. Others come as shareware , which means that you are free to try it out but should register (and pay a fee) if you use the programs regularly. This BioCompanion is shareware, too. Keep in mind that you verified the license status of a software program before installation or before using it. Software piracy is illegal and punishable by law.

The avialability of these programs might vary. Some authors distribute the software via floppy disks or internet, some use electronic mail.

CAUTION

Be sure that you are aware of ramifications if you request or install programs which are possibly of dubious origin. Viruses, trojan horses or other security-related issues will render any activities as PROHIBITED unless the installation has been allowed by the system manager and/or the person responsible for software maintenance at your site.

IN COMMERCIAL ENVIRONMENTS, YOU ARE USUALLY NOT ALLOWED TO INSTALL SOFTWARE YOURELF.

Benefits of additional software

Disadvantages

Data transfer and formatting

In order to utilise the additional software, you will need to transfer your sequence data to STADEN format, i.e., you will strip all non-sequence information from the sequence and transfer the data to the local computer with 'ftp' or a similar file transfer protocol. This is in particular useful if you use the WWW interface to access programs which will execute jobs remotely. The program readseq is very useful to interconvert all kinds of sequence formats. Alternatively, try one of the programs of the GCG package. To get information about GCG's refor matting programs, use

$ genmanual sequence_exchange


JAM produced file: HOW8.HTML as [next page] , or [overview] , or [table of contents]