[ Previous chapter ][
This chapter ][ Next chapter ]
A very important
prerequisite
of
biological sequence is a
defined alphabet
which lists the allowed symbols and their meaning. The DNA alphabet
is rather simple at the first glance: A,G,C,T,U,N (any). However,
in order to express common
properties in between nucleotides,
the IUPAC has defined so-called "ambiguity symbols" which
allow
to name with the letter
S
either
G
or
C
character.
================================= Begin Exercise 4
A small hunting exercise: Find the DNA alphabet.
In order to use biological sequences, the computer utilises a defined
alphabet
which assigns nucleotides or amino acids to single
letters. These assignments are
written in
tables.
The purpose of this exercise is to find the IUPAC table
for nucleotide symbols. Proceed as follows:
================================= End Exercise 4
The characterisation of a biological sequence can be achieved by
counting
the composition. It does, however, matter
very little if you
know that your sequence contains a certain
number of residues
as you want to correlate this with either other
residues or other sequences. Therefore,
you need to
normalise
the numbers. Two procedures are applied:
The data are expressed as percent (%) of the whole sequence.
Basically, you normalise the length
of the entire sequence
to 100 and you determine the (fictious) composition of this sequence.
Without knowing how many residues/base pairs your protein or
DNA sequence has, you might compare
sequences with these numbers easily.
E.g., if a protein has 33% glycine, this is a very high
number
and might be significant for a given class of proteins
(e.g., collagens).
Sequences will be of different length, or contain several domains.
Therefore, in order to
compare fragments, you consider only
a part of the sequence, which has a shorter length than
the
entire sequence. This is an essential concept in
biocomputing
methods and is called a
window.
You determine only the desired figure of composition in
this window
and plot this versus the entire sequence.
Consider the
following sequence:
Next, let us analyse this sequence with a
window
of the size 8. This window is symbolised as
|------|
in the plot
below. We count the composition in the first fragment -
tgatggtc
- three
G's and one C.
This corresponds to a total value of 4,
and we enter this in the middle of
our window
of 8, which is at position 4.
This technique is not restricted to DNA sequences. However, there are
no default symbols of
the
protein
alphabet as all amino acid symbols (20) require the whole alphabet.
The trick is to change the sequence artificially; you will try this
in
an exercise later .
================================= Begin Exercise 5
DNA composition: Determine the G/C content of a DNA sequence as function of the sequence.
In order to determine the G/C content, follow this schedule:
================================= End Exercise 5
NOTE: Programs which produce graphics are marked with an asteriks (*).
Windows are a general concept which are not specific to the 'window'
or GCG programs in general.
The use of avaraging techniques, such as windows,
is essential in BioComputing and will also
be used in secondary structure
prediction of proteins, or in reading frame determination.
The
larger
the window, the more detailed will be the curve result as the
number of patterns found or not found in the given sequence will increase.
E.g., a window size
of 30 will allow up to 30 occurrences of "S", whereas
a window size of 5 will only have five
different values.
The
smaller
the window, the more precise will be the location of a given effect.
Values computed for a given window will be plotted at the middle of
the window. A window of
30 has an uncertainty of fifteen.
Subsection 8.2.1 Principle
Subsection 8.2.2 Detailed View on the "windows" Technique
tgatggtcaagtaaactatgaagagtttgtacaaatgatgacagcaaagtgcgaagac
This sequence fragment has a length of 58 base pairs. If you add the
numbers for G
(15) and C (7), you end up with a total of
22. Your sequence, therefore, has a
G/C
content
of (22/58*100) = 38%.
^
no. of | 8
G or C | 7
found | 6
in 8 + 5
| 4 x
| 3
| 2
| 1
| 0
tgatggtcaagtaaactatgaagagtttgtacaaatgatgacagcaaagtgcgaagac
----+----+----+----+----+----+----+----+----+----+----+------------->
5 10 15 20 25 30 35 40 45 50 55 sequence
|------| --> moving this window of 8 along the sequence
Our window started at position 1.
We then shift our window along the
sequence in the
increment of 4 (1 were possible but we use
a larger increment here in order to reduce work).
This
means that the starts now at position 5 and we will plot at position
5 + 8/2 = 9, or
expressed as formula, [start of window] + [size of
window] divided by 2. The second window,
therefore, is
ggtcaagt
which has three G's and one C. We plot this result
at position
9, 4 of our graph:
^
no. of | 8
G or C | 7
found | 6
in 8 + 5
| 4 x x
| 3
| 2
| 1
| 0
tgatggtcaagtaaactatgaagagtttgtacaaatgatgacagcaaagtgcgaagac
----+----+----+----+----+----+----+----+----+----+----+------------->
5 10 15 20 25 30 35 40 45 50 55 sequence
|------| --> moving this window of 8 along the sequence
Continuing, the next window starts at position 13 (9 plus the increment of
4, which
we selected above as increment) and has the composition
aagtaaac
. This time,
the number of (G or C) is two
and we plot at position 13,2:
^
no. of | 8
G or C | 7
found | 6
in 8 + 5
| 4 x x
| 3
| 2 x
| 1
| 0
tgatggtcaagtaaactatgaagagtttgtacaaatgatgacagcaaagtgcgaagac
----+----+----+----+----+----+----+----+----+----+----+------------->
5 10 15 20 25 30 35 40 45 50 55 sequence
|------| --> moving this window of 8 along the sequence
You might want to complete the plot yourself. The
result
of such a plot is that you will visualise the
G/C
richness
of the sequence
as function of the sequence
which allows conclusions on the
functionality of this DNA fragment.
Subsection 8.2.3 Programs
Subsection 8.2.4 Effect of the Window Size
[ previous
chapter ],[ this chapter ][
next chapter ]
, [next page/section] , or [overview] , or [table of contents]