[ Previous chapter ][
This chapter ][ Next chapter ]
To use sequence data on the computer, you need to know what a
sequence format
is. After you have transferred a sequence file to your computer,
you may need to
reformat
the sequence to work with a given sequence analysis
package.
This section explains most of the solutions using the GCG package.
Briefly, a sequence format is a convention which
defines what part of a data file is interpreted
as sequence
and what part as additional data. Depending on the software
package used for
sequence analysis, some of these additional data are of
importance for processing.
E.g., the GCG sequence format defines the type
of the
sequence data (protein or DNA). Other elements set
the date, or log a line containing the length
of the file. Therefore,
a given sequence format is difficult to maintain in a
normal text editor, and, usually, computer programs dedicated to
sequence
editing will deal with the details.
Plain Text Sequence Format
The plain text sequence format is typically generated by word processors (saved as
text file
with line breaks)
or by electronic sources such as mail messages.
A plain text format contains
only sequence data
and, therefore, may need
editing
to strip all additional data.
Sequence Formats Ready to Use with Sequence Analysis Packages
Sequence formats ready to use with sequence analysis packages are either generated
within a
sequence analysis package, e.g.,
or come from the original databases. This can be either from a local
installation,
or by network retrieval tools, such as electronic
mail or
World-Wide Web . Examples:
Refer to the section "Transfer of Data" for details
on
how to copy data from and to other computers.
Reformatting from other Packages
Find out what the format of the
sequence is, edit it manually
(if required) and
try one of the programs of the GCG package.
To get information about GCG's reformatting programs, use
% genmanual sequence_exchange
The following selection of programs should cover most
of your needs.
NOTE: When reformatting a sequence, the sequence name of the
original sequence is adopted. The original file name is replaced
by the name of the corresponding
sequence in the originating database; e.g.,
if you have used the file name 'test.seq' in an
export from
electronic mail ,
WWW , ENTREZ , or similar,
and the entry obtained
from EMBL is M12345, the reformatting will result
in a file called 'm12345.embl' and not retain
the file name used before.
% fromgenbank
% fromembl
% fromig
% frompir
% fromstaden
if errors occur (because lines are too long), use first
% chopup
Reformatting from Established GCG Sequences.
The program 'reformat' allows you to format from and to various GCG-type
of formats and also
helps if sequences are corrupted (checksum
changed). To get information on this program, use
% genhelp reformat
or
% reformat -check
(The sequence of exercise 1 must be treated this way).
Reformatting from "Unknowns"
A plain text
file (only sequence data) is a good place to
start. Use your text editor to create such a file.
To convert the file to
the GCG sequence format, put two periods (..)
at the beginning of the
text. Then, use
% reformat
to obtain the final GCG-type format.
[next page] , or [overview] , or [table of contents]
Subsection 5.4.1 Sequence Formats
ID (entry code)
... (other fields) ...
SQ (then the sequence)
//
LOCUS (entry code)
..........(other fields) ...
ORIGIN ...(then the sequence)
//
>P1; (entry code)
... (one line of text) ...
(sequence, finished by a *)
(eventually, more text)
Subsection 5.4.2 Reformatting Sequences
from GENBANK (NCBI)
from EMBL (EBI)
from the IG suite package
from programs of PIR (e.g., ATLAS)
from ASCII files (e.g., electronic mail, or STADEN package)