JAMF ArchiveBioCompanion as published in 1995THIS IS THE REFERENCE CODE AS PUBLISHED. Doelz, R. Optimal production of biological documentation: the JAM format. Comput. Applic. Biosci. 11, 224-226 (1995).The version you are currently viewing is the one printed and distributed via the Internet from the server of BioComputing Basel. Version 3.1 of the BioCompanion was published with version 2 of the JAMF software. The server that was indicated in the documentation has ceased to exist. Version 3.2 of the BioCompanion was not publicly available for free but was shareware that was distributed with GCG's software release 9. For the purpose of enhanced editing, JAMF was partially rewritten and the proprietary version 3.x of JAMF was used from 1996 onwards. The Biocompanion is available in a current version from the publisher . It has significantly changed both in software and content. |
|||
|
|
|
||
The following description assumes that the
set up for the GCG package has already been completed. These
sequences used for input must be in
GCG format . Use reformat or
genmanual sequence_exchange for details on
how to convert sequences into GCG format.
Text Output: Many
programs ask for an output file. This is mostly letter-by-letter
output. You can review this text with the (command line) command
type/page . WPI users may
use the display
function in the "output manager window".
Graphic Output:
To display graphics, you need to tell the GCG software that you
want to use either display (=screen) or printer (=hardcopy).
Command line users
must be sure that everything
is fine, to achieve this, initialise the display with the command
setplot and select the option which seems
best suitable to you. If you select X-Windows
, you should
NOTE: A window should come upon your screen after the
selection.
WPI users may use the
display function in the "output manager window".
Check that display or printer
work properly when you use them the first time and produce a
test graphics with plottest .
This is the last time that these details are described.
The following chapters of the BioCompanion assume that all these
setup operations
have been successfully completed.
A very important prerequisite of biological
sequence is a defined alphabet
which lists the allowed symbols and
their meaning. The DNA alphabet is rather simple at the first
glance: A,G,C,T,U,N (any). However, in order to express common
properties in between nucleotides, the IUPAC has defined so-called
"ambiguity symbols" which allow to name with the letter S
either G or C character.
================================= Begin Exercise 4
A small hunting exercise: Find the DNA alphabet.
In order to use biological sequences, the computer utilises
a defined alphabet which assigns nucleotides
or amino acids to single letters. These assignments are written
in tables. The purpose of this exercise is to
find the IUPAC table for nucleotide symbols. Proceed as follows:
================================= End Exercise 4
The characterisation of a biological sequence can be achieved
by counting the composition. It does, however,
matter very little if you know that your sequence contains a
certain number of residues as you want to correlate
this with either other residues or other sequences. Therefore,
you need to normalise the numbers. Two procedures
are applied:
The data are expressed as percent (%) of the whole sequence.
Basically, you normalise the length of the entire sequence to
100 and you determine the (fictious) composition of this sequence.
Without knowing how many residues/base pairs your protein or
DNA sequence has, you might compare sequences with these numbers
easily. E.g., if a protein has 33% glycine, this is a very high
number and might be significant for a given class of proteins
(e.g., collagens).
Sequences will be of different length, or contain several domains.
Therefore, in order to compare fragments, you consider only
a part of the sequence, which has a shorter length than the
entire sequence. This is an essential concept in biocomputing
methods and is called a window.
You determine only the desired figure of composition
in this window and plot this versus the entire sequence.
Consider the following sequence:
Next, let us analyse this sequence with a window of
the size 8. This window is symbolised as |------| in
the plot below. We count the composition in the first fragment
- tgatggtc - three G's and one C. This corresponds
to a total value of 4, and we enter this in the middle of our
window of 8, which is at position 4.
This technique is not restricted to DNA sequences. However, there
are no default symbols of the protein alphabet
as all amino acid symbols (20) require the whole alphabet. The
trick is to change the sequence artificially; you will try this
in an exercise later
.
================================= Begin Exercise 5
DNA composition: Determine the G/C content of a DNA sequence
as function of the sequence.
In order to determine the G/C content, follow this schedule:
================================= End Exercise 5
NOTE: Programs which produce graphics are marked with
an asteriks (*).
The larger the window, the more detailed
will be the curve result as the number of patterns found or not
found in the given sequence will increase. E.g., a window size
of 30 will allow up to 30 occurrences of "S", whereas a window
size of 5 will only have five different values.
The smaller the window, the more precise will
be the location of a given effect. Values computed for a given
window will be plotted at the middle of the window. A window
of 30 has an uncertainty of fifteen.
If you have the
EGCG programs installed, you might want to use following
programs:
Use the egenhelp of these programs for
more details.
What is a reading frame? Mother nature will know. For the
computer, a reading frame is any stretch of a sequence which
starts with a start codon end ends with a stop
codon. In between, the protein is assumed.
The situation is rather trivial if we analyse cDNA samples.
However, keep in mind that side effects (badly controlled library
generation etc.) and sequencing errors might also be an issue
in cDNA analysis.
Genomic Sequence Analysis: Detection
of Coding Regions
The systematic sequencing analysis of genomes will
result in long sequences which are unknown whether these translate
at all into a protein. Therefore, one of the
prime targets of genomic sequence analysis will be to spot the
location of splicing sites, coding regions, and intron/exon boundaries.
NOTE: It is important to realise that, due to the complexity
of the matter, no computer analysis is perfect. The methods available
perform a PREDICTION which may not be reliable. Results require
experimental validation or other supportive data.
One approach is to analyse the sequence and analyse the regularity
of occurrence of the nucleotide patterns. It has been
shown that, in a reading frame, certain patterns will occur in
periodic fashion. The detection of such patterns in a relatively
large range (window of 300 bp and more) is the operational hypothesis
of the 'testcode' program in the GCG package.
Other programs will be more powerful and use a defined set of
patterns as a 'learning set'. Due to the
restrictions of patterns , the programs apply very sophisticated
methods which go far beyond pattern matching. Keep in mind, however,
that most of these prediction programs will be severely restricted
to the species which has been used to create the programs. You
are encouraged to carefully study the documentation of the gene
prediction programs to check whether these are applicable to
your problem.
As prediction programs operate with statistical methods, results
figures are frequently expressed as 'probability'. Unfortunately,
more than a single pattern set or sequence motif is required
to build the prediction, and many programs express more than
one number. E.g., a given algorithm might predict a reading frame
with a 80% probability, but with this probability threshold only
66% of the test cases are predicted correctly. Therefore, you
are also encouraged to try the program of choice with several
well-known examples which are similar to your unknown sequence
in order to access the numerical figures with a better knowledge.
Programs beyond GCG are not currently
widely supported but exist. Some mail and WWW servers in the
Internet offer tools to predict gene structures.
The explora program will predict genetic models
of yeast sequences, and the genefinder program
suite collection allows the prediction of genetic features in
human, drosophila and nematode systems. These programs run via
the Hierarchical Access System for Sequence
Libraries in Europe (HASSLE) and are specifically adapted for
use within the GCG package at the BioComputing facility in Basel.
Comparison of Codon Frequencies
Amino acids like methionine or tryptophane use a single triplet
of bases for coding. Other amino acids, like serine, use up to
six different codons, and these codons may theoretically be used
equally well. Highly expressed genes, however, frequently show
a preference for a certain codon, while other
codons are rarely used or not utilised at all ("rare codons").
If we analyse a reading frame and detect it schematically (like
in the GCG program 'frames'), it is possible to determine the
codon usage within the predicted reading frame as the start and
stop codons are "known". As we know the expected codon
usage from other genes, we may compare the two and obtain a
numeric value which is either supportive or will possibly suggest
that the reading frame will not be expressed in vivo due to the
unfavourable codon usage. Using a window of
several codons (such as a stretch of 25 or more codons), statistics
might be significant enough to even spot reading frame errors
as the comparison curve for the codon usage will decline numerically
at the point of error. Comparing several alternatives, a decision
for a reading frame is theoretically possible. The 'codonpreference'
program of the GCG package uses this approach. Refer to the
explanations above for details
on window techniques.
A more pragmatic approach analyses to determine possible reading
frames is to compute the usage of G or C
in the third base of the predicted codons. The value,
expressed as GC bias, is meaningful in similar
fashion as the comparison curve for the codon preferences, and
is also plotted by the GCG program 'codonpreference'.
NOTE: Predictions based on the comparison of codon usage
are not applicable or at least negatively affected if
1) no codon bias is observed at all (weakly expressed genes)
2) reading frame errors occur repeatedly
3) exons are not removed by sequence editing before analysis
NOTE: It is assumed that you have completed all the setup operations .
The reading frame, obviously, must be
known for the purpose of reading frame prediction based on codon
usage with the 'codonpreference' program. The GCG program provides
tables for various organisms: $ codonpreference/nobias
The GCG package program
'findpatterns' will use patterns from any database of patterns
or even typed-in patterns in order to locate these in your sequence.
One application of this program is to search the transcription
factor database from D.Gosh. This database will be used
if you type
$ findpatterns /dat=genmoredata:tfsites.dat
NOTE: The application of patterns in DNA analysis is,
due to the complexity of the matter, very restricted. Transcription
factors as listed in the database above are examples and not
really exploited patterns. This will result in many "false positives".
See below for
a discussion.
================================= Begin Exercise 6
DNA reading frame analysis: Determine the reading frame
of the DNA sequence GENEMBL:M19311 and compare the result with
the annotation of this database sequence.
To solve this problem, follow this schedule:
================================= Begin Exercise 6
The analysis of a DNA sequence to estimate composition or
codon region was based on little auxiliary data. If we want to
detect possible cleavage sites in a biological sequence we need
to have the known sites listed in a database. In
contrast to the codon usage tables, which are systematic
and complete, restriction enzyme tables need to consider
different sites, including variances.
In order to define a pattern in the
nucleotide alphabet, the use of ambiguity symbols is a good way
to allow several different symbols to be used at one position.
Proteins, however, will need a different mechanism. The definition
and properties of patterns are described in a
later section of the BioCompanion. Briefly, the restriction
enzyme cleavage sites are described in a format called a pattern
with the following properties:
NOTE: This type of program assumes that cleavage and binding
site are extremely close to each other. The programs using patterns
to describe restriction enzymes are NOT usable for other purposes
unless explicitly mentioned.
Limitations of the
Pattern Approach in DNA Analysis
Patterns in DNA are known by example mostly. Very little is known
on detailed properties (such as promoter requirements). Look
at the following example. The pattern language for a simple promoter,
such as The program prime
can predict "good" primers from a given nucleotide
sequence. Note that the use of this program does only suggest
multiple primers; the user has to evaluate suitable positions
from the output. The program 'prime' computes a text output and
a graphic overview which is suitable to identify regions of good
primers; as usually the first top hits are located in only two
or three regions rather than being equally dispersed on the entire
region of interest. The 'prime' program has some limitations,
as it should not be used to predict primers with a target of
more than a certain length and a certain maximum length for each
region.
Other software packages, specifically, PC-type applications,
might be worth considering if you use primer predictions frequently.
Restriction enzymes will cleave DNA sequences at certain positions.
A program which analyses such cleavage sites will, therefore,
compare the entire DNA input sequence versus a database
of enzymes and locate matches of the
DNA sequence and the binding site of the enzyme as described
in the database. The output of the programs will print the location
of the cleavage sites either schematically (an overview plot,
as graphics), or analytically (printed sequence and restriction
enzyme cleavage sites). The output of the latter is the most
detailed view, however, overloaded with information and occasionally
too crowded. Therefore, it is possible to exclude enzymes from the display
even if they would theoretically match. The criteria for this
exclusion can be the following: If the size of the
fragment matters, programs are available which will display the
fragments sorted by size rather than by cleavage position. For
this purpose, it matters whether the sequence is circular or
not (such as plasmids: One cut will not result in a size difference).
Last not least, the GCG program package provides a functionality
for drawing plasmids with their cleavage sites.
Useful options to 'map', 'mapsort', 'mapplot':
The program plasmidmap (*) reads a *.tick
file generated by the mapsort program
used with the plasmid option. To get started,
you might want to fetch the example files
and try it with these:
$ fetch pgamma.*
$ plasmidmap @pgamma.fil Further information is available in printed form, and it is
highly recommended to review this documentation before you spend
extended time periods with the programs. Alternatively, on-line
help is available. To get started, use the command genhelp
plasmidmap description .
NOTE: The GCG software package graphics cannot be easily
transferred into PC type of graphics in version 8.x of the package.
Encapsulated postscript will be an option for high-quality prints
in combination with manual reinking if required. Public domain
and commercial software packages might suit the purpose better
than the 'plasmidmap' program from GCG. Before you investigate
these alternatives, however, please make sure that the effort
is worth.
Two programs should be run, one after the other. The first is
needed to determine the reading frame. If you know it already
or if you ran the corresponding analysis programs ( frames
or similar
) you can immediately proceed to run the second program
% translate
Note that you might want to reverse the
sequence before translation.
The second option is
to use the program map with the corresponding
translation options, and afterwards extract the corresponding
peptides from the output with
% extractpeptide
Translation of Genomic Sequences
The translation of genomic sequences requires that, before running
the program translate , you know the intron/exon
borders. Without this knowledge, erroneous sequences will be
the result. Unfortunately, the availability of programs to detect
these genetically relevant sites is very limited and, if possible
at all, limited by the reliability of the
predictions of computational models. The GCG program package
does not currently support this type of prediction.
Translation of Database Sequences
In the DNA sequence databases, entries of genetic origin will
frequently cross-reference the protein sequence. This saves
you a translation as you may use the protein sequence directly.
If this is not true or if you do not have the protein sequence
database available locally, DNA sequences of genetic origin
occasionally show CDS features which describe
the position of reading frames and the corresponding intron/exon
boundaries. The translate program will allow
to translate one after the other. Alternatively, the WWW browser
of the SRS system will allow to click on the peptide
feature and translate the sequence automatically. In order to get this sequence
into GCG format, you might use the mouse and highlight the sequence
(and only the sequence). Next, copy the sequence
into the paste buffer (use the pull-down of the <Edit>
menu). Then, on the command line, you give the
command (as an example, for the sequence my.seq)
$ create my.seq
and, subsequently, you paste the contents
into the sequence (again, by using the <Edit> pull-down).
What you have done is to open a file with the create
command and you have appended the text
into this file. Therefore, after the paste, the
file is still open. You need to close it accordingly by typing
<CTRL><Z>.
Next, you need to reformat the file to GCG
format. As it is plain text, it may complain about a missing
".." divider but, this should not matter.
NOTE:
1) You need to be sure that you copy only the sequence.
2) The WPI interface is not useful for this trick.
3) Apply manual checking whether you succeeded (is M at position
?)
4) Make sure that no stop codons (indicated by "*") are present
in your sequence. Translation of Mitochondrial Sequences
Be aware that translation requires a table which contains the
amino acid symbols resolved to the individual codons. Some sequences
might have other translation patterns. The GCG software offers
these different tables. Refer to the genhelp section
on the translate program.
The translation from amino acid code to DNA requires a correct
codon usage table . The default table might not
be suited for detailed analysis. To get an organism-specific
codon usage table, refer to corresponding
section of the BioCompanion, or compile your own one from
an existing (set of) sequence(s) with the program
$ codonfrequency
To use a specific table to translate DNA into protein, use
$ backtranslate your.seq codon.file
e.g., The change of T to U and U to T can be done with the
reformat program:
$ reformat/DNA
or
$ reformat/RNA
Similarly, the case of
sequence characters can be changed with the reformat
program by using the options tolower and
toupper, respectively.
If problems occur because of a wrong sequence
type assignment, you need to reformat the sequence specfically
with type 'NUCLEIC' or 'PROTEIN', respectively.
There are various tools available which allow you to analyse
single protein sequences.
Principle
The desire to predict a secondary or even tertiary structure
from the amino acid sequence is known as the folding
problem. Unfortunately, there is no solution available
at this point of time. Two approaches are in use:
In order to predict the structures of peptides or even proteins
with yet unknown homology to known proteins, it is required to
use methods which assign parameters to amino acids which will
allow to get an estimation for a possible secondary structure.
The programs in use frequently use three-state (or
four-state) predictions:
All methods work similar to the following assumption:
Based on an analysis of known protein structures, amino acids
are classified into "classes" which will most frequently occur
in one of the three (or four) states as described above. Statistical
methods are applied to evaluate the variance of these occurrences
in all positions of all proteins, and numerical values are assigned
to each amino acid and its probability to be found in one of
the three (or four) states. A window is applied
(see the section on window analysis
earlier in this chapter) in order to calculate a value which
is significant for the given position and amino acid in the peptide
which shall be predicted. Plotting the curves of the three (or
four) states it is possible to derive a prediction for the whole
protein. Note that, due to this way of calculation, the states
are not mutually exclusive and therefore a considerable uncertainty
is implied.
Chou and Fassmann pioneered this approach already
many years ago and used the tertiary and secondary structures
of proteins available at that time. The applicability of their
method, therefore, is constrained to the set of proteins available
at that time ( globular, soluble proteins).
The precision is, on average, estimated to be 60-65%
if the Chou and Fassman method is used. Robson
and Garnier have improved precision to close
to 65% on average by using a matrix of
neighbour values rather than a single window approach. Eisenberg
has used additional parameters such as "hydrophaty".
Summarising, the use of secondary structure prediction from
scratch is highly speculative and should not be the only method
to reach conclusions. Averages on aligned sequences as described
in later chapters of the BioCompanion will give the best results
with an expectation to be close to 70% accurate.
Based on a sequence
homology with an existing, structurally known protein
fragment, it is possible to use advanced computer graphics displays
to build the peptide fragment covering the homology according
to the known structure
of the protein found in the database. The minimum sequence homology
required to do this kind of model building severely depends on
the individual peptide but should definitively be better than
30% - which means that more than 10 amino acids must be identical
in a sequence of 30 amino acids. Building the model
will be achieved in three steps:
Initially, the sequence to be modelled is aligned to
the sequence with the known structure coordinates. This procedure
is described later
and shall make you familiar with the kind of replacements which
will need to be done in the model. Be careful to avoid the "impossible"
- glycine residues frequently are required to adopt an unusual
configuration, proline residues are either required in turn structures,
or their introduction will possibly break secondary structure
elements, and disulphide bridges are extremely important in protein
structure.
Secondly, an advanced computer graphics program is used. Famous
representatives include, but are not limited to, the Insight
II program package from Biosym
Technologies, the Cerius
program suite from Molecular Simulations, and the 'Whatif'
program as created and used at
EMBL, Heidelberg. The first two examples
are high-end commercial software packages (since fall 1995, available
from a single vendor) and require significant investment, whereas
the 'Whatif' package is available for negotiable price to academic
researches. All of these packages usually require a Silicon
Graphics computer system with high-end graphics and
a large monitor. If possible, a stereo viewing capability should
be available. Be aware of the complexity of these programs -
this is not rocket science any longer. Replacing amino acid side
chains might be an easy procedure on the display but requires
considerable thinking to get
a reasonable biologically relevant model.
Last, once the initial model is completed, mathematical procedures
will need to be applied to refine this model.
This will require that so-called potentials are
assigned to the atoms of your peptide - which means that the
groups and residues are classified by chemical topology to be
of a certain class of arrangement.
E.g., a peptide group will be semi-planar due to the hybridisation
of the participating atoms. Subsequently, a semi-empirical force
field is applied, and the structure is fit to "ideal"
angles, atom distances and bond geometry with techniques known
as molecular dynamics and structure
refinement.
Molecular dynamics will initiate movements of atoms by introducing
a "temperature" which allows to "shake" the molecule into a more
ideal configuration. To do this step reliably, it is
essential to know of constraints which limit some of
the three-dimensional properties of the protein or peptide. These
constraints will be data from X-ray crystallography or NMR structure
analysis. Without these constraints, model building is
highly speculative. You should not desire to start a
study involving molecular dynamics refinement unless you have
at least some constraints such as disulphides,
maximal cross-section or other data. A very serious and hence
unsolved problem is to compensate for interactions of your model
peptide with the environment. This environment is frequently
"vacuum" - at least this is computationally the easiest approach.
Be aware that this is not a very satisfying approximation. Water
or membrane molecules are much better suited as interaction partners.
However, supercomputing performance will be needed to run this
analysis, and the results will be only of limited relevance if
you did not use any constraints.
Summarising, it must be stated that the molecular modelling
approach for secondary structure prediction may be very
useful and suggestive if small changes of a
peptide sequence to an already known peptide
structure are to be applied. The larger the deviation, i.e.,
the lower the similarity to a known structure, the less relevant
will be the results. Experimental structure evaluation
with X-ray and/or NMR techniques might be required
for satisfactory models. Programs for secondary structure prediction
Remember that the prediction
of secondary structure without a reasonable homology to three-dimensional
data is rather unsafe. Programs which employ three-dimensional
modelling techniques require special hardware (powerful computers)
and dedicated software, hence, are beyond the scope of the BioCompanion
.
The programs available to you in the desktop environment wil
typically be restricted to secondary structure prediction from
scratch. In order to display the secondary structure plots, you
need to have a computer
screen which is capable of displaying graphics. It is recommended
that you have access to a colour graphics device if you want
to run these programs.
Remember to have set the graphics environment correctly with
setplot if you work with GCG locally.
X-Windows setups must have set the
DISPLAY environment correctly.
To display several
measures of secondary structure, use
$ pepplot
To generate a table of several measures (with a comparison
of Garnier-Robson and Chou-Fassman predictions), use
$ peptidestructure
The generated output file can be plotted "two-dimensionally",
but for serious inspection the one-dimensional plotting is recommended
(use the corresponding menu option):
$ plotstructure
EGCG Programs
If you have the
EGCG programs installed, you might want to use
sigcleave , helixturnhelix and
antigenic for the analysis of peptide sequences. Use
the egenhelp of these programs for more details.
Given the assumption that the protein fragment adopts a helical
structure, the program helicalwheel can be
used.
The program moments plots a three-dimensional
map which displays moments
of hydrophathy in dependence of the sequence and the rotational
angle of the peptide bond (90 - 110 degrees is OK for helices,
0 or 180 degrees is indicating chances for beta sheet).
EGCG Programs
If you have the
EGCG programs installed, you might want to use
pepcoil and pepne for the analysis
of peptide sequences with aliphatic edges, and pepwheel
for analysis similar to "helicalwheel" as described
above. Use the egenhelp of these programs
for more details.
The programs peptidemap
and peptidesort work like the DNA counterparts .
The isoelectric point of the denatured protein can be determined
from the titration curve plotted by the program isoelectric
.
Frequently, you might want to know where "acidic" or other
regions of your protein sequence are located. As ambiguity symbols
in the single-letter peptide alphabet are not defined, you might
rewrite your sequence and use the window
program in order to plot the result with statplot
. The data for the simplify program
are located in a file which you can get from the GCG program
database with the command
$ FETCH SIMPLIFY.TXT
This file has a self-describing format, and basically will
replace each amino acid listed in the second column with an amino
acid listed in the first column:
================================= Begin Exercise 7
Summary of single-sequence tools: Translate the sequence
GENEMBL:M19311 in the determined reading frame, perform a secondary
structure prediction from scratch, and plot the acidic amino
acids as function of the sequence.
To use amino acid sequences, the computer needs a defined
reading frame in the
DNA sequence which allows the translation into a peptide
sequence. The translated amino acids are written into a peptide
sequence. The purpose of this exercise is to create
the sequence M19311.pep and predict its secondary
structure. Proceed as follows:
================================= End Exercise 7
Many additional
single sequence analysis programs are available as public
domain programs. Others come as shareware ,
which means that you are free to try it out but should register
(and pay a fee) if you use the programs regularly. This BioCompanion
is shareware, too. Keep in mind that you verified the
license status of a software program before installation or before
using it. Software piracy is illegal and punishable by law.
The avialability of these programs might vary. Some authors distribute
the software via floppy disks or internet, some use electronic
mail.
CAUTION
Be sure that you are aware of ramifications if you request or
install programs which are possibly of dubious origin. Viruses,
trojan horses or other security-related issues will render any
activities as PROHIBITED unless the installation has been allowed
by the system manager and/or the person responsible for software
maintenance at your site.
IN COMMERCIAL ENVIRONMENTS, YOU ARE USUALLY NOT ALLOWED TO INSTALL
SOFTWARE YOURELF.
In order to utilise the additional software, you will need
to transfer your sequence data to STADEN format,
i.e., you will strip all non-sequence information from the sequence
and transfer the data to the local computer with 'ftp'
or a similar file transfer protocol.
This is in particular useful if you use
the WWW interface to access
programs which will execute jobs remotely. The program
readseq is very useful to
interconvert all kinds of sequence formats. Alternatively, try
one of the programs of the GCG package. To get information about
GCG's refor matting programs, use
$ genmanual sequence_exchange
Prerequisites for all examples and instructions
Options for result display
NOTICE
Composition-Counting Programs
Principle
Detailed View on the "windows" Technique
tgatggtcaagtaaactatgaagagtttgtacaaatgatgacagcaaagtgcgaagac
This sequence fragment has a length of 58 base pairs.
If you add the numbers for G (15) and C (7), you end up with
a total of 22. Your sequence, therefore, has a G/C content
of (22/58*100) = 38%.
^
no. of | 8
G or C | 7
found | 6
in 8 + 5
| 4 x
| 3
| 2
| 1
| 0
tgatggtcaagtaaactatgaagagtttgtacaaatgatgacagcaaagtgcgaagac
----+----+----+----+----+----+----+----+----+----+----+------------->
5 10 15 20 25 30 35 40 45 50 55 sequence
|------| --> moving this window of 8 along the sequence
Our window started at position 1. We then shift our window
along the sequence in the increment of 4 (1 were possible but
we use a larger increment here in order to reduce work). This
means that the starts now at position 5 and we will plot at position
5 + 8/2 = 9, or expressed as formula, [start of window] + [size
of window] divided by 2. The second window, therefore, is ggtcaagt
which has three G's and one C. We plot this result at
position 9, 4 of our graph:
^
no. of | 8
G or C | 7
found | 6
in 8 + 5
| 4 x x
| 3
| 2
| 1
| 0
tgatggtcaagtaaactatgaagagtttgtacaaatgatgacagcaaagtgcgaagac
----+----+----+----+----+----+----+----+----+----+----+------------->
5 10 15 20 25 30 35 40 45 50 55 sequence
|------| --> moving this window of 8 along the sequence
Continuing, the next window starts at position 13 (9 plus
the increment of 4) and has the composition aagtaaac
. This time, the number of (G or C) is two and we plot
at position 13,2:
^
no. of | 8
G or C | 7
found | 6
in 8 + 5
| 4 x x
| 3
| 2 x
| 1
| 0
tgatggtcaagtaaactatgaagagtttgtacaaatgatgacagcaaagtgcgaagac
----+----+----+----+----+----+----+----+----+----+----+------------->
5 10 15 20 25 30 35 40 45 50 55 sequence
|------| --> moving this window of 8 along the sequence
You might want to complete the plot yourself. The result
of such a plot is that you will visualise the G/C
richness of the sequence as function of the
sequence which allows conclusions on the functionality
of this DNA fragment.
Programs
Effect of the Window Size in the 'window' Program:
EGCG Programs
Reading Frame Estimation Programs
Principle
Programs
organism table name
---------------------------------------------
human genmoredata:human_high.cod
fruit fly genmoredata:drosophila_high.cod
yeast genmoredata:yeast_high.cod
plants genmoredata:maize_high.cod
(default: genrundata:eco_high.cod
E.coli highly expressed genes)
You can compile your own codon frequency table with the
program codonfrequency (see
above ). If you want to see
just the codon preference, or use monochrome terminals only,
use the command
Restriction Enzyme Mapping Programs
Principle of Patterns
TATA box, about 30 to 300 less important base pairs, and the start codon
will read in a pattern language as
ATG(N){30,300}ATG
However, the ATG as required in the pattern must not be
any methionine, but the start codon. Therefore,
it depends very much on the input sequence which is used for
comparison in the pattern analysis whether the result of this
comparison is of use or not. Most genetically important elements
are, unfortunately, only known as example. Therefore, if a general
pattern is derived from these examples, we risk that many comparisons
of the pattern to an input sequence are computationally correct
but biologically irrelevant. Therefore, the straightforward application
of patterns is valid for restriction mapping, but will be problematic
for genetic motifs.
Using the 'prime' Program to Predict Primers in a Pattern
Approach
Principle of Restriction Enzyme Mapping in a Pattern
Approach
Programs
/once only 1 cut in entire sequence
/sixbase only sixbase cutters
/exclude=200,500 do not consider enzymes cutting between 200 and 500
/mincut=2 exclude enzymes cutting once or not at all
/maxcut=1 select only enzymes cutting once
mapplot only:
/noplot/out=my.txt suppresses plot and creates text file my.txt instead
/double doubles height of characters in graphics mode
mapsort only:
/plasmid to create *.tick file as input to plasmidmap
Plasmid Drawing Translation
DNA to Protein
Protein to DNA
backtranslate hp7764.pep drosophila_high.cod
The second file name will be assumed to be the codon file.
Examine the result using the methods described in the
file handling section .
DNA to RNA and Vice Versa
Protein Tools
Secondary Structure Prediction
Visualisation of Secondary Structure
Fragmentation
Isoelectric Point
Simplification of Protein Sequences
D DEQN
will make all D, E, Q and N symbols convert
to D. This might look biologically irrelevant
but a good approach to get all acidic amino acids to read "D"
- as these can be plotted now with the 'window/statplot' programs.
Hints on additional software
Benefits of additional software
Disadvantages
Data transfer and formatting
JAM produced file:
HOW8.HTML as [next page] , or [overview] , or [table of contents]