JAMF ArchiveBioCompanion as published in 1995THIS IS THE REFERENCE CODE AS PUBLISHED. Doelz, R. Optimal production of biological documentation: the JAM format. Comput. Applic. Biosci. 11, 224-226 (1995).The version you are currently viewing is the one printed and distributed via the Internet from the server of BioComputing Basel. Version 3.1 of the BioCompanion was published with version 2 of the JAMF software. The server that was indicated in the documentation has ceased to exist. Version 3.2 of the BioCompanion was not publicly available for free but was shareware that was distributed with GCG's software release 9. For the purpose of enhanced editing, JAMF was partially rewritten and the proprietary version 3.x of JAMF was used from 1996 onwards. The Biocompanion is available in a current version from the publisher . It has significantly changed both in software and content. |
|||
|
|
|
||
Once a sequence search is completed, the question arises whether
the found similarities do share a similarity amongst each other.
This can be achieved in either automatic or manual fashion by
using programs which will
align the sequences of interest.
If you painted a map from the result of your sequence search
as described
earlier , it might be obvious that sequences do usually
share similarity only in parts. This will leave the ends or overhang
parts of two sequences badly aligned due to low similarity. Therefore,
before alignments are attempted, it is a good practice to create
sequence fragments of approximately the same length which will
allow programs to operate more easily.
If sequences are not specifically taylored for multiple
sequence alignment, programs might fail or report alignmnets
unreliably.
The approach used for automatic sequence alignment can be
described as "clustering" of the most similar sequences. In a
first step, the program will need to find the sequence pair(s)
which share(s) the most obvious similarity. To achieve this,
each sequence is compared to each, which results in (n*n)/2
comparisons if we have n sequences
to compare. As in rigorous sequence searching, a comparison is
made using sequence comparison tables to compute the best possible
alignment and score this appropriately. (Note that the scores
will be not as desired if the sequences have not been tailored
as mentioned above).
Once the comparison for each possible sequence pair has been
completed , the "best" candidates serve as nuclei, and additional
sequences are aligned to the already existing alignment. This
will work well with similar proteins but too many gaps, in particular
on DNA level, will most probably not yield the desired result.
The largest errors will occur if regions with low similarity
are used as "closest" set, as these will cause trouble for additional
sequences to be matched.
If problems are encountered because similarity cannot
be determined well enough automatically, either manual alignment
is required or the selection of sequences must be improved by
tayloring or omission of very remotely related fragments.
The result of a multiple sequence alignment will be a block
of sequences which are nicely painted on top of each other. Programs
exist which will plot the degree of similarity along the sequence
coordinate. Other programs allow to print or paint the output
nicely. The GCG programs also produce a figure which schematically
displays the level of similarity
as a dendrogram. As outlined below, the dendrogram which
illustrates sequence similarity must not mistakenly be interpreted
as phylogenetic tree, however, can be used to verify
that the alignment proceeded as expected.
Multiple Sequence Alignment is NOT the tool for you if you
are working on fragment assembly or shotgun sequencing. In order
to align multiple sequences reliably, the similarity amongst
the members of the alignment should be extensive along the entire
length rather than only overlapping fragments.
If you start from scratch, use the command
$ lineup
The screen will ask
Add a New Sequence
Move to the command line with <CTRL><Z>,
give the command 'new', type the new sequence name.
Move to the command line with <CTRL><Z>,
give the 'get' command, and type the new sequence name. The
sequence given is either a sequence in your own directory as
created with commands from the GCG package or a sequence from
the database.
The
'lineup' editor works similar as the
'seqed' program discussed earlier for single sequence input.
However, as multiple sequences will show as several lines, the
<CURSOR-UP> and <CURSOR-DOWN> keys will be used to
jump between different sequences in the alignment. The period
(.) key will be used to insert gaps.
CAUTION: If you have a key mapping file in the current
directory such as in use for sophisticated use of the 'seqed'
program the period might be missing, therefore, does not work
in 'lineup'. SOLUTION: Delete the file or add the period accordingly.
Consensus Calculation
One of the sequences (the one at line 0) is
special: It might hold a consensus sequence which is automatically
updated upon gap insertion or sequence
shifting. To activate this mechanism, move to the
move to the command line (<CTRL><Z>), and type auto.
Get Help
Move to the command line (<CTRL><Z>),
and type help.
Exit
Move to the command line (<CTRL><Z>),
and type exit.
If you used the lineup program earlier, it
will have created a so-called file of sequence names (FOSN, extension
*.fil) and numerous fragments which represent your sequences
in its lined-up form. For example, to reload an alignment of
the group eco, call the lineup program with
$ lineup eco
If you used another
program to produce a multisequence alignment (e.g., the program
pileup ), this might be in the multiple sequences
format ( MSF, extension *.msf). To use 'lineup' on a file called
eco.msf, call it as
$ lineup/msf eco
You can reformat each of the formats
into the other with the command reformat. Use the section of
genhelp to learn about how to convert MSF
to FOSN format and FOSN to MSF format.
If you need to name more
than one sequence, you can use asterisks (*) as "wildcards".
(See section file handling
for use of wildcards in file naming conventions). The GCG
programs, however, allow you to write a file which contains only
filenames rather than the files itself. To create such a list
file, call the system editor
(see section editing )
and enter all file names of the sequences you are interested
in after having entered a line with two periods ("..") as the
first line. You can mix either your own sequences or use names
from the database. Refer to the section on
Lists for details on GCG list files.
As described above, you can use several GCG and GCG-like programs
to produce a file of sequence names (also known as
Lists ). Remember that the file should contain only sequence
names after the two periods (This is taken care of by the GCG
programs automatically if applied correctly).
The GCG program pileup can align many sequences
by specifying them either as single files using wildcards (e.g.,
*.seq) or by using a file of sequence names and specifying these
as @my.fil (if the file my.fil contains the sequence names).
(This is usually called automated multisequence alignment).
As an example, the result of pileup utilising data from a
findpatterns run is shown:
$ pileup
The program pileup generates an output file
which shows the results of the clustering process.
NOTE: This visual representation of sequence similarity
must not be used as a phylogenetic tree because the length and
ordering of sequences is based on sequence similarity and not
on phylogenetic algorithms. For coarse reviewing of sequence
relationships, however, the dendrogram could be considered.
Otherwise, use the programs distances
and growtree as described below. To visualise
the dendrogram, remember that you need to define the plotting
environment with setplot if you did not do
this earlier or work with the Wisconsin Package Interface (WPI).
Eventually, define the
X-Windows environment correctly. Next, issue the command
$ figure pileup.figure
pretty generates an output file which shows
the results of the automatic sequence alignment letter-by-letter.
If you happen to have the
extensions to the GCG programs installed, you can also use
the command prettyplot (written by Peter Rice
at EMBL) using graphics (use setplot if needed).
'prettyplot' takes the same command options as pretty. Additionally,
you might try prettybox to shade the symbols
(postscript printer required). To visualise the alignment, you
can use a variety of special command line parameters. Use the
option
$ pretty /check
It is important that you specify the multiple
sequence alignment correctly, e.g.,
$ pretty/cons/diff="-" pileup.msf{*}
Improving 'pretty' Output
Sometimes, the file name descriptors
of the pretty output file are not needed. In this case, the
replace program can be used to have the file name
replaced by spaces. To accomplish this, create a text file (see
the editing section for
help on how to edit files) and write two periods, as well as
the replacement string. If you use all default settings of pileup,
such a file would be named my.replace and look like $ replace pretty.pretty my.replace
new.pretty
Improving 'prettyplot' Output
Occasionally, the file name descriptors of the 'prettyplot' output
file are not needed. To accomplish the removal, you can either
proceed as above, or specify
$ prettyplot/cons/diff="-"/shortname pileup.msf{*}
The program plotsimilarity
uses a window to slide across the sequence alignment
and plots the similarity of the sequences. To learn more about
the options of plotsimilarity, use the check
option. This program requires graphics and should only be used
after the plotting environment has been defined ( setplot
). The
sequence alignment should be specified as an msf file, e.g.,
pileup.msf{*}.
The program 'distances'
writes a text file with a matrix showing each-to-each comparison
scores. It has been changed in version 8 of the GCG software
and is described below .
The program pileup (see
above ) is used to create a multiple sequence alignment.
Then, the program distances is used to compute
the distances in between the different aligned files, and
growtree can draw a figure which shows the sequences
arranged in a tree which represents the distances as computed
from the alignment.
NOTE: This way of tree creation depends entirely on the
alignment. If the similarity of the sequences considered is low,
the alignment produced by 'pileup' might be faulty and require
manual refinement. The resulting tree must be considered to be
of significantly reduced value as possible errors will effect
the resulting tree significantly.
At the time of writing,
the programs paupsearch and paupdisplay
were unfortunately not included in the GCG distribution.
The
EGCG program tophylip will allow a conversion
to the format required by the 'Phylip' program suite. Phylip
(by J.Felsenstein) and other packages are also available
on personal computers and macintoshes.
Various public domain and shareware
programs might be useful to you. Refer to the
discussion on additonal software earlier in the BioCompanion
for benefits and problems of adding software.
The comparison programs for pairwise comparison, such as gap must be run with the "out"
option in order to generate files suitable for manual alignment:
$ gap/out
See the section on pairwise comparison for details of presentation
of the output. As described in the description of the
lineup editor
above , sequence files are loaded into 'lineup' with the
get command. The manipulation of the .out files is straightforward.
NOTE: The manual creation of multiple sequence alignments
might be considerably influenced by the user and is not recommended
for publication or tree creation.
Rigorous searching implements the alignment methods used by
programs like bestfit in a sequence database
searching routine. The usefulness of this enhanced searching
can be enhanced by using so-called profiles: Once
a sequence search revealed homologies to several sequences, it
is desirable to identify shared regions of homology in a multiple
sequence alignment. The information buried in the alignment can
be re-utilised further on to be used in analysis and searches.
Remote similarities of the "twilight zone" are not necessarily
easily detected by heuristic searching methods. Various algorithms
implement alignment procedures known from pairwise alignments
but require significantly more resources. The GCG program package
currently features the profile search method from Gribskov et
al.
Profile searching unites the benefit of
comparison matrices with the features of sequence-specific
allowance of exchanges such as already used in the
pattern approach . However, the substitutions of patterns
follow a yes/no scheme. To enhance sensitivity,
the matrix values for a given exchange in profiles are weighted
according to the observed alignment.
Profile searching is a complex method and severely depends on
the input sequence diversity in order to justify extensive work.
Please make sure that you have read suitable introductory literature.
The
GCG Program Reference Manual, for example, has a Profile
Analysis Essay which you should read before you use
the methods extensively.
The data used for profile searching must be in GCG GCG format.
Use reformat or genmanual sequence_exchange
for details. For multiple sequence alignment, there
are several possibilities of file formats to start with.
The program profilemake generates a profile
from a set of aligned sequences in msf format.
The program profilesearch
uses a profile generated by profilemake
and produces a listing of best-fitting sequences in a database.
For aligning these with the profile the program profilesegments
is required. The
EGCG program tprofilesearch allows searching
of a protein profile in an on-the-fly translated DNA database.
The program profilegap uses a profile generated
by profilemake and compares this to a sequence
with a comparison algorithm of an end-to-end alignment (
gap ).
================================= Begin Exercise 13
Understand the benefit, scope and limitations of a rigorous
searching method. Generate a profile and show the difference
in searching the alignment vs. searching the consensus.
================================= End Exercise 13
Principle of Multiple Sequence Alignment
Prerequisites
Finding the Best
Grouping
Result Evaluation
Limitations
Programs to Deal with Multiple Sequences
Manual Editing with the Multisequence Editor
Lineup of what sequence group ?
And you are requested to
type in a name (try to use less than 9 characters, and use only
characters or numbers). Use all-lowercase letters . The screen,
will then open and with a similar
display to the normal sequence editor
seqed .
<CTRL><Z>
: new
Create a NEW sequence (ten letter max.):
test1
Move to desired position with arrows. Press <RETURN>
to select position.
<CURSOR-UP> <RETURN>
NEW sequence named test1 placed at 0,1
Include an Existing Sequence
<CTRL><Z>
: get
GET what sequence:
protein:mchu
LINEUP get of protein:mchu
Begin (* 1 *) ? <RETURN>
End (* 149 *) ? <RETURN>
Reverse (* No *) ? <RETURN>
That Begins: MADQLTEEQI
and Ends: EEFQMMTAK
Is this what you want to include (* Yes *) ? <RETURN>
Move to desired position with arrows. Press RETURN to
select position.
<CURSOR-UP> <RETURN>
Enter name (ten letter max.): Mchu <RETURN>
LINEUP get of "protein:mchu" from: 1 to: 149
....
Navigation Manual Editing of Sequence Alignments
Manual Editing of File of Sequence Names
Automatic Creation of File of Sequence Names
Automatic Generation of a Multiple Sequence Alignment
PILEUP creates a multiple sequence alignment from a group of
relate sequences using progressive, pairwise alignments. It can
also plot tree showing the clustering relationships used to
create the alignment.
PileUp of what sequences ? @findpatterns.find
1 MCHU 149 aa
2 MCRB 148 aa
3 MCRT 149 aa
4 MCBO 148 aa
What is the gap weight (* 3.00 *) ? <RETURN>
What is the gap length weight (* 0.10 *) ? <RETURN>
This program can display the clustering relationships graphically.
Do you want to:
A) Plot to a FIGURE file called "PileUp.Figure"
B) Plot graphics on HP7550 attached to /dev/tty
C) Suppress the plot Please
choose one (* A *): <RETURN>
The minimum density for a one-page plot is 4.0 sequences/100 platen units.
What density do you want (* 4.0 *) ? <RETURN>
What should I call the output file name (* pileup.msf *) ? <RETURN>
Determining pairwise similarity scores...
1 x 2 1.49
1 x 3 1.50
1 x 4 1.50
2 x 3 1.49
2 x 4 1.49
3 x 4 1.50
Aligning...
1 .......-.
2 .......-.
3 .......-.
FIGURE instructions are now being written into pileup.figure.
Total sequences: 4
Alignment length: 149
CPU time: 00.84
Output file:/biox/biocomputing/doelz/pileup.msf
(The output file name would be slightly different but
the procedure is identical on VMS and UNIX).
Display of the Dendrogram Generated by the 'pileup' program
Presentation of the Alignment
pileup.msf{*}
is the correct way to specify a sequence alignment. To
generate a sophisticated print with a consensus sequence and
showing the differences only, use the command
..
"@pileup.msf"
" "
where pileup.msf was the name of the msf file. The command
to be issued for replacement is
Graphic Presentation of Similarity in the Alignment
Schematic Presentation Sequence Similarity
Phylogeny
Creation of a Tree
PAUP-based Methods
Other Packages
Manual Creation of Sequence Alignments
Profiles
Principle
Formats of Sequences
| | file |
| type of file | ending | called as (example)
+-------------------------+----------+-------------------------
| normal sequence file | .seq or |
| | .pep | my.seq
+-------------------------+----------+-------------------------
| file of sequence names | .frg and |
| (from 'lineup', etc.) | .fil | @my.fil
+-------------------------+----------+-------------------------
| multiple sequence files | |
| (from 'pileup') | .msf | my.msf{*}
Profile Generation
Profile Searching
Profile Analysis
JAM produced file:
SEQUEN12.HTML as [next page] , or [overview] , or [table of contents]