Section 8-2: Composition-Counting Programs

[ Previous chapter ][ This chapter ][ Next chapter ]


Subsection 8.2.1

Principle

A very important prerequisite of biological sequence is a defined alphabet which lists the allowed symbols and their meaning. The DNA alphabet is rather simple at the first glance: A,G,C,T,U,N (any). However, in order to express common properties in between nucleotides, the IUPAC has defined so-called "ambiguity symbols" which allow to name with the letter S either G or C character.

================================= Begin Exercise 4

A small hunting exercise: Find the DNA alphabet.

In order to use biological sequences, the computer utilises a defined alphabet which assigns nucleotides or amino acids to single letters. These assignments are written in tables. The purpose of this exercise is to find the IUPAC table for nucleotide symbols. Proceed as follows:

================================= End Exercise 4

The characterisation of a biological sequence can be achieved by counting the composition. It does, however, matter very little if you know that your sequence contains a certain number of residues as you want to correlate this with either other residues or other sequences. Therefore, you need to normalise the numbers. Two procedures are applied:


Subsection 8.2.2

Detailed View on the "windows" Technique

Consider the following sequence:

 

  
    tgatggtcaagtaaactatgaagagtttgtacaaatgatgacagcaaagtgcgaagac
  

  
This sequence fragment has a length of 58 base pairs. If you add the numbers for G (15) and C (7), you end up with a total of 22. Your sequence, therefore, has a G/C content of (22/58*100) = 38%.

Next, let us analyse this sequence with a window of the size 8. This window is symbolised as |------| in the plot below. We count the composition in the first fragment - tgatggtc - three G's and one C. This corresponds to a total value of 4, and we enter this in the middle of our window of 8, which is at position 4.

 
       ^
  
no. of | 8   
  
G or C | 7    
  
found  | 6
  
in 8   + 5
  
       | 4     x   
  
       | 3       
  
       | 2               
  
       | 1               
  
       | 0 
  
           tgatggtcaagtaaactatgaagagtttgtacaaatgatgacagcaaagtgcgaagac    
  
           ----+----+----+----+----+----+----+----+----+----+----+------------->
  
               5   10   15   20   25   30   35   40   45   50   55   sequence
  

  
           |------|   --> moving this window of 8 along the sequence
  
Our window started at position 1. We then shift our window along the sequence in the increment of 4 (1 were possible but we use a larger increment here in order to reduce work). This means that the starts now at position 5 and we will plot at position 5 + 8/2 = 9, or expressed as formula, [start of window] + [size of window] divided by 2. The second window, therefore, is ggtcaagt which has three G's and one C. We plot this result at position 9, 4 of our graph:
 
       ^
  
no. of | 8   
  
G or C | 7    
  
found  | 6
  
in 8   + 5
  
       | 4     x   x
  
       | 3       
  
       | 2               
  
       | 1               
  
       | 0 
  
           tgatggtcaagtaaactatgaagagtttgtacaaatgatgacagcaaagtgcgaagac    
  
           ----+----+----+----+----+----+----+----+----+----+----+------------->
  
               5   10   15   20   25   30   35   40   45   50   55   sequence
  

  
               |------|   --> moving this window of 8 along the sequence
  
Continuing, the next window starts at position 13 (9 plus the increment of 4, which we selected above as increment) and has the composition aagtaaac . This time, the number of (G or C) is two and we plot at position 13,2:
 
       ^
  
no. of | 8   
  
G or C | 7    
  
found  | 6
  
in 8   + 5
  
       | 4     x   x
  
       | 3       
  
       | 2             x     
  
       | 1               
  
       | 0 
  
           tgatggtcaagtaaactatgaagagtttgtacaaatgatgacagcaaagtgcgaagac    
  
           ----+----+----+----+----+----+----+----+----+----+----+------------->
  
               5   10   15   20   25   30   35   40   45   50   55   sequence
  

  
                   |------|   --> moving this window of 8 along the sequence
  
You might want to complete the plot yourself. The result of such a plot is that you will visualise the G/C richness of the sequence as function of the sequence which allows conclusions on the functionality of this DNA fragment.

This technique is not restricted to DNA sequences. However, there are no default symbols of the protein alphabet as all amino acid symbols (20) require the whole alphabet. The trick is to change the sequence artificially; you will try this in an exercise later .

================================= Begin Exercise 5

DNA composition: Determine the G/C content of a DNA sequence as function of the sequence.

In order to determine the G/C content, follow this schedule:

================================= End Exercise 5


Subsection 8.2.3

Programs

NOTE: Programs which produce graphics are marked with an asteriks (*).


Subsection 8.2.4

Effect of the Window Size

Windows are a general concept which are not specific to the 'window' or GCG programs in general. The use of avaraging techniques, such as windows, is essential in BioComputing and will also be used in secondary structure prediction of proteins, or in reading frame determination.

The larger the window, the more detailed will be the curve result as the number of patterns found or not found in the given sequence will increase. E.g., a window size of 30 will allow up to 30 occurrences of "S", whereas a window size of 5 will only have five different values.

The smaller the window, the more precise will be the location of a given effect. Values computed for a given window will be plotted at the middle of the window. A window of 30 has an uncertainty of fifteen.


[
previous chapter ],[ this chapter ][ next chapter ] , [next page/section] , or [overview] , or [table of contents]