[ Previous chapter ][
This chapter ][ Next chapter ]
What is a reading frame? Mother nature will know.
For the computer,
a reading frame is any stretch of a sequence which starts with a
start codon
end ends with a
stop codon.
In between, the
protein
is
assumed. The situation is rather trivial if we analyse
cDNA
samples. However,
keep in mind that side effects (badly controlled
library generation etc.) and sequencing
errors might also be an issue in cDNA analysis.
Genomic Sequence
Analysis: Detection of Coding Regions
The systematic sequencing analysis of
genomes
will result in long sequences
which are unknown whether these
translate
at all
into a protein. Therefore,
one of the prime targets of genomic sequence analysis
will be to spot the location of splicing
sites, coding regions,
and intron/exon boundaries.
NOTE: It is important to realise that, due to the complexity of the matter,
no computer
analysis is perfect. The methods available
perform a PREDICTION which may not be reliable.
Results require experimental validation or other supportive data.
One approach is to analyse the sequence and analyse the
regularity of occurrence
of the nucleotide patterns. It has been shown that, in a reading frame,
certain patterns
will occur in periodic fashion. The detection of
such patterns in a relatively large range
(window of 300 bp and more)
is the operational hypothesis of the
'testcode'
program
in the GCG package.
Other programs will be more powerful and use a defined
set of patterns as a 'learning set'.
Due to the
restrictions of patterns , the
programs apply very
sophisticated methods which go far beyond pattern matching. Keep in mind,
however, that
most of these prediction programs will be severely restricted
to the species
which has been used to create the programs.
You are encouraged to carefully study the
documentation of the
gene prediction programs to check whether these are applicable to your
problem.
As prediction programs operate with statistical methods, results figures
are frequently expressed
as 'probability'. Unfortunately,
more than a single pattern
set or sequence motif is required to build the prediction,
and many
programs express more than one number. E.g., a given algorithm might
predict a reading
frame with a 80% probability, but with this probability
threshold only 66% of the test cases
are predicted correctly. Therefore,
you are also encouraged to try the program of choice with
several well-known
examples which are similar to your unknown sequence in order to access
the numerical figures with a better knowledge.
Programs beyond GCG are not currently
widely supported but exist. Some mail and WWW servers
in the
Internet offer tools to predict gene structures.
The explora program will predict genetic models of
yeast sequences, and
the genefinder program suite collection
allows the prodiction of genetic
features in human, drosophila and
nematode systems. These programs run via
the Hierarchical
Access system for Sequence Libraries in Europe (HASSLE) and are
specifically
adapted for use within the GCG package at the
BioComputing facility in Basel.
Comparison of Codon Frequencies
Amino acids like methionine or tryptophane use a single triplet
of bases for coding. Other amino
acids, like serine, use up to six
different codons, and these codons may theoretically be used
equally well.
Highly expressed genes, however, frequently show a
preference
for
a certain codon, while other codons are rarely used or not
utilised at all ("rare codons").
If we analyse a reading frame
and detect it schematically (like in the GCG program 'frames'),
it is possible to determine the codon usage within the predicted
reading frame as the start
and stop codons are "known". As we know
the
expected
codon usage from other
genes, we may compare the two and obtain a
numeric value which is either supportive or will
possibly suggest
that the reading frame will not be expressed in vivo due to the
unfavourable
codon usage. Using a
window
of several codons (such as a stretch of 25 or
more codons), statistics
might be significant enough to even spot reading frame errors as the
comparison curve for the codon usage will decline numerically at
the point of error. Comparing
several alternatives, a decision for
a reading frame is theoretically possible. The 'codonpreference'
program of the GCG package uses this approach. Refer to the
explanations above for details on window techniques.
A more pragmatic approach analyses to determine possible reading frames is
to compute the usage
of
G
or
C
in the third base of the predicted codons.
The value, expressed as
GC bias,
is meaningful in similar fashion as the
comparison curve for the
codon preferences, and is also plotted by the GCG program
'codonpreference'.
NOTE: Predictions based on the comparison of codon usage are not applicable
or at least
negatively affected if
1) no codon bias is observed at all (weakly expressed genes)
2) reading frame errors occur repeatedly
3) exons are not removed by sequence editing before analysis
NOTE: It is assumed that you have completed all the
setup operations .
The reading frame, obviously, must be known for the purpose of
reading
frame prediction based on codon usage with the
'codonpreference' program. The GCG program provides
tables for various organisms:
% codonpreference -nobias
The GCG package program 'findpatterns'
will use
patterns from any database of patterns or even typed-in patterns
in order to locate
these in your sequence. One application of this
program is to search the
transcription
factor database
from D.Gosh. This database will be used if you type
% findpatterns -dat=genmoredata:tfsites.dat
NOTE: The application of patterns in DNA analysis is, due to the complexity
of the matter,
very restricted. Transcription factors as listed in
the database above are examples and not
really exploited patterns.
This will result in many "false positives". See
below
for a discussion.
================================= Begin Exercise 6
DNA reading frame analysis: Determine the reading frame of the DNA sequence
GENEMBL:M19311
and compare the result with the
annotation of this database sequence.
To solve this problem, follow this schedule:
================================= End Exercise 6
Subsection 8.3.1 Principle
Subsection 8.3.2 Programs
organism table name
---------------------------------------------
human genmoredata:human_high.cod
fruit fly genmoredata:drosophila_high.cod
yeast genmoredata:yeast_high.cod
plants genmoredata:maize_high.cod
(default: genrundata:eco_high.cod
E.coli highly expressed genes)
You can compile your own codon frequency table with the program
codonfrequency
(see above ).
If you want to see just the codon preference, or use monochrome
terminals only, use the command
[ previous chapter ],[
this chapter ][ next chapter ]
, [next page/section] , or [overview] , or [table of contents]