Section 8-3: Reading Frame Estimation Programs

[ Previous chapter ][ This chapter ][ Next chapter ]


Subsection 8.3.1

Principle

What is a reading frame? Mother nature will know. For the computer, a reading frame is any stretch of a sequence which starts with a start codon end ends with a stop codon. In between, the protein is assumed. The situation is rather trivial if we analyse cDNA samples. However, keep in mind that side effects (badly controlled library generation etc.) and sequencing errors might also be an issue in cDNA analysis.

Genomic Sequence Analysis: Detection of Coding Regions

The systematic sequencing analysis of genomes will result in long sequences which are unknown whether these translate at all into a protein. Therefore, one of the prime targets of genomic sequence analysis will be to spot the location of splicing sites, coding regions, and intron/exon boundaries.

NOTE: It is important to realise that, due to the complexity of the matter, no computer analysis is perfect. The methods available perform a PREDICTION which may not be reliable. Results require experimental validation or other supportive data.

One approach is to analyse the sequence and analyse the regularity of occurrence of the nucleotide patterns. It has been shown that, in a reading frame, certain patterns will occur in periodic fashion. The detection of such patterns in a relatively large range (window of 300 bp and more) is the operational hypothesis of the 'testcode' program in the GCG package.

Other programs will be more powerful and use a defined set of patterns as a 'learning set'. Due to the restrictions of patterns , the programs apply very sophisticated methods which go far beyond pattern matching. Keep in mind, however, that most of these prediction programs will be severely restricted to the species which has been used to create the programs. You are encouraged to carefully study the documentation of the gene prediction programs to check whether these are applicable to your problem.

As prediction programs operate with statistical methods, results figures are frequently expressed as 'probability'. Unfortunately, more than a single pattern set or sequence motif is required to build the prediction, and many programs express more than one number. E.g., a given algorithm might predict a reading frame with a 80% probability, but with this probability threshold only 66% of the test cases are predicted correctly. Therefore, you are also encouraged to try the program of choice with several well-known examples which are similar to your unknown sequence in order to access the numerical figures with a better knowledge.

Programs beyond GCG are not currently widely supported but exist. Some mail and WWW servers in the Internet offer tools to predict gene structures.

The explora program will predict genetic models of yeast sequences, and the genefinder program suite collection allows the prodiction of genetic features in human, drosophila and nematode systems. These programs run via the Hierarchical Access system for Sequence Libraries in Europe (HASSLE) and are specifically adapted for use within the GCG package at the BioComputing facility in Basel.

Comparison of Codon Frequencies

Amino acids like methionine or tryptophane use a single triplet of bases for coding. Other amino acids, like serine, use up to six different codons, and these codons may theoretically be used equally well. Highly expressed genes, however, frequently show a preference for a certain codon, while other codons are rarely used or not utilised at all ("rare codons").

If we analyse a reading frame and detect it schematically (like in the GCG program 'frames'), it is possible to determine the codon usage within the predicted reading frame as the start and stop codons are "known". As we know the expected codon usage from other genes, we may compare the two and obtain a numeric value which is either supportive or will possibly suggest that the reading frame will not be expressed in vivo due to the unfavourable codon usage. Using a window of several codons (such as a stretch of 25 or more codons), statistics might be significant enough to even spot reading frame errors as the comparison curve for the codon usage will decline numerically at the point of error. Comparing several alternatives, a decision for a reading frame is theoretically possible. The 'codonpreference' program of the GCG package uses this approach. Refer to the explanations above for details on window techniques.

A more pragmatic approach analyses to determine possible reading frames is to compute the usage of G or C in the third base of the predicted codons. The value, expressed as GC bias, is meaningful in similar fashion as the comparison curve for the codon preferences, and is also plotted by the GCG program 'codonpreference'.

NOTE: Predictions based on the comparison of codon usage are not applicable or at least negatively affected if

1) no codon bias is observed at all (weakly expressed genes)

2) reading frame errors occur repeatedly

3) exons are not removed by sequence editing before analysis


Subsection 8.3.2

Programs

NOTE: It is assumed that you have completed all the setup operations .

The reading frame, obviously, must be known for the purpose of reading frame prediction based on codon usage with the 'codonpreference' program. The GCG program provides tables for various organisms:

 

  
organism      table name
  
---------------------------------------------
  

  
human         genmoredata:human_high.cod
  
fruit fly     genmoredata:drosophila_high.cod
  
yeast         genmoredata:yeast_high.cod
  
plants        genmoredata:maize_high.cod 
  

  
(default:     genrundata:eco_high.cod
  
                                        E.coli highly expressed genes)
  

  
You can compile your own codon frequency table with the program codonfrequency (see above ). If you want to see just the codon preference, or use monochrome terminals only, use the command

% codonpreference -nobias

The GCG package program 'findpatterns' will use patterns from any database of patterns or even typed-in patterns in order to locate these in your sequence. One application of this program is to search the transcription factor database from D.Gosh. This database will be used if you type

% findpatterns -dat=genmoredata:tfsites.dat

NOTE: The application of patterns in DNA analysis is, due to the complexity of the matter, very restricted. Transcription factors as listed in the database above are examples and not really exploited patterns. This will result in many "false positives". See below for a discussion.

================================= Begin Exercise 6

DNA reading frame analysis: Determine the reading frame of the DNA sequence GENEMBL:M19311 and compare the result with the annotation of this database sequence.

To solve this problem, follow this schedule:

================================= End Exercise 6


[ previous chapter ],[ this chapter ][ next chapter ] , [next page/section] , or [overview] , or [table of contents]