Section 5-4: Import of Sequences to the GCG Package

[ Previous chapter ][ This chapter ][ Next chapter ] To use sequence data on the computer, you need to know what a sequence format is. After you have transferred a sequence file to your computer, you may need to reformat the sequence to work with a given sequence analysis package. This section explains most of the solutions using the GCG package.


Subsection 5.4.1

Sequence Formats

Briefly, a sequence format is a convention which defines what part of a data file is interpreted as sequence and what part as additional data. Depending on the software package used for sequence analysis, some of these additional data are of importance for processing. E.g., the GCG sequence format defines the type of the sequence data (protein or DNA). Other elements set the date, or log a line containing the length of the file. Therefore, a given sequence format is difficult to maintain in a normal text editor, and, usually, computer programs dedicated to sequence editing will deal with the details.

Plain Text Sequence Format

The plain text sequence format is typically generated by word processors (saved as text file with line breaks) or by electronic sources such as mail messages. A plain text format contains only sequence data and, therefore, may need editing to strip all additional data.

Sequence Formats Ready to Use with Sequence Analysis Packages

Sequence formats ready to use with sequence analysis packages are either generated within a sequence analysis package, e.g.,

or come from the original databases. This can be either from a local installation, or by network retrieval tools, such as electronic mail or World-Wide Web . Examples:

 
ID  (entry code)
  
... (other fields) ...
  
SQ  (then the sequence) 
  
//
  

 
LOCUS     (entry code)
  
..........(other fields) ...
  
ORIGIN ...(then the sequence) 
  
//
  

 
>P1; (entry code)
  
... (one line of text) ... 
  
(sequence, finished by a *)
  
(eventually, more text) 
  


Subsection 5.4.2

Reformatting Sequences

Refer to the section "Transfer of Data" for details on how to copy data from and to other computers.

Reformatting from other Packages

Find out what the format of the sequence is, edit it manually (if required) and try one of the programs of the GCG package. To get information about GCG's reformatting programs, use

% genmanual sequence_exchange

The following selection of programs should cover most of your needs.

NOTE: When reformatting a sequence, the sequence name of the original sequence is adopted. The original file name is replaced by the name of the corresponding sequence in the originating database; e.g., if you have used the file name 'test.seq' in an export from electronic mail , WWW , ENTREZ , or similar, and the entry obtained from EMBL is M12345, the reformatting will result in a file called 'm12345.embl' and not retain the file name used before.

 
from GENBANK (NCBI)            
  

% fromgenbank

 
from EMBL (EBI)             
  

% fromembl

 
from the IG suite package 
  

% fromig

 
from programs of PIR (e.g., ATLAS)
  

% frompir

 
from ASCII files (e.g., electronic mail, or STADEN package)
  

% fromstaden

if errors occur (because lines are too long), use first

% chopup

Reformatting from Established GCG Sequences.

The program 'reformat' allows you to format from and to various GCG-type of formats and also helps if sequences are corrupted (checksum changed). To get information on this program, use

% genhelp reformat

or

% reformat -check

(The sequence of exercise 1 must be treated this way).

Reformatting from "Unknowns"

A plain text file (only sequence data) is a good place to start. Use your text editor to create such a file. To convert the file to the GCG sequence format, put two periods (..) at the beginning of the text. Then, use

% reformat

to obtain the final GCG-type format.


[next page] , or [overview] , or [table of contents]