Clustal W Help

CLUSTAL W Multiple Sequence Alignment Program

(version 1.7, June 1997)

Clustal W is a major re-write of the multiple alignment program Clustal V

(Ref: Higgins, Bleasby, and Fuchs (1991) CABIOS, 8, 189-191).

The major modifications are described in the paper:

Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994)

CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment

through sequence weighting, positions-specific gap penalties and weight matrix

choice. Nucleic Acids Research, 22:4673-4680.

If you have any questions or comments, please contact one of:

Des Higgins E-mail: Higgins@EBI.ac.uk

Toby Gibson E-mail: Gibson@EMBL-Heidelberg.DE

>>HELP 1<< General help for CLUSTAL W

Clustal W is a general purpose multiple alignment program for DNA or proteins.

SEQUENCE INPUT: all sequences must be in 1 file, one after another.

7 formats are automatically recognised: NBRF/PIR, EMBL/SWISSPROT,

Pearson (Fasta), Clustal (*.aln), GCG/MSF (Pileup), GCG9/RSF and GDE flat file.

All non-alphabetic characters (spaces, digits, punctuation marks) are ignored

except "-" which is used to indicate a GAP ("." in GCG/MSF).

To do a MULTIPLE ALIGNMENT on a set of sequences, use item 1 from this menu to

INPUT them; go to menu item 2 to do the multiple alignment.

PROFILE ALIGNMENTS (menu item 3) are used to align 2 alignments. Use this to

add a new sequence to an old alignment, or to use secondary structure to guide

the alignment process. GAPS in the old alignments are indicated using the "-"

character. PROFILES can be input in ANY of the allowed formats; just

use "-" (or "." for MSF/RSF) for each gap position.

PHYLOGENETIC TREES (menu item 4) can be calculated from old alignments (read in

with "-" characters to indicate gaps) OR after a multiple alignment while the

alignment is still in memory.

The program tries to automatically recognise the different file formats used

and to guess whether the sequences are amino acid or nucleotide. This is not

always foolproof.

FASTA and NBRF/PIR formats are recognised by having a ">" as the first

character in the file.

EMBL/Swiss Prot formats are recognised by the letters

ID at the start of the file (the token for the entry name field).

CLUSTAL format is recognised by the word CLUSTAL at the beginning of the file.

GCG/MSF format is recognised by one of the following:

- the word PileUp at the start of the file.

- the word !!AA_MULTIPLE_ALIGNMENT or !!NA_MULTIPLE_ALIGNMENT

at the start of the file.

- the word MSF on the first line of the line, and the characters ..

at the end of this line.

GCG/RSF format is recognised by the word !!RICH_SEQUENCE at the beginning of

the file.

If 85% or more of the characters in the sequence are from A,C,G,T,U or N, the

sequence will be assumed to be nucleotide. This works in 97.3% of cases

but watch out!

>>HELP 2<< Help for multiple alignments

If you have already loaded sequences, use menu item 1 to do the complete

multiple alignment. You will be prompted for 2 output files: 1 for the

alignment itself; another to store a dendrogram that describes the similarity

of the sequences to each other.

Multiple alignments are carried out in 3 stages (automatically done from menu

item 1 ...Do complete multiple alignments now):

1) all sequences are compared to each other (pairwise alignments);

2) a dendrogram (like a phylogenetic tree) is constructed, describing the

approximate groupings of the sequences by similarity (stored in a file).

3) the final multiple alignment is carried out, using the dendrogram as a guide.

PAIRWISE ALIGNMENT parameters control the speed/sensitivity of the initial

alignments.

MULTIPLE ALIGNMENT parameters control the gaps in the final multiple alignments.

RESET GAPS (menu item 7) will remove any new gaps introduced into the sequences

during multiple alignment if you wish to change the parameters and try again.

This only takes effect just before you do a second multiple alignment. You

can make phylogenetic trees after alignment whether or not this is ON.

If you turn this OFF, the new gaps are kept even if you do a second multiple

alignment. This allows you to iterate the alignment gradually. Sometimes, the

alignment is improved by a second or third pass.

SCREEN DISPLAY (menu item 8) can be used to send the output alignments to the

screen as well as to the output file.

You can skip the first stages (pairwise alignments; dendrogram) by using an

old dendrogram file (menu item 3); or you can just produce the dendrogram

with no final multiple alignment (menu item 2).

OUTPUT FORMAT: Menu item 9 (format options) allows you to choose from 5

different alignment formats (CLUSTAL, GCG, NBRF/PIR, PHYLIP and GDE).

>>HELP 3<< Help for pairwise alignment parameters

A distance is calculated between every pair of sequences and these are

used to construct the dendrogram which guides the final multiple alignment.

The scores are calculated from separate pairwise alignments. These can be

calculated using 2 methods: dynamic programming (slow but accurate) or by the

method of Wilbur and Lipman (extremely fast but approximate).

You can choose between the 2 alignment methods using menu option 8. The

slow/accurate method is fine for short sequences but will be VERY SLOW

for many (e.g. >20) long (e.g. >1000 residue) sequences.

SLOW/ACCURATE alignment parameters:

These parameters do not have any affect on the speed of the alignments. They

are used to give initial alignments which are then rescored to give percent

identity scores. These % scores are the ones which are displayed on the

screen. The scores are converted to distances for the trees.

1) Gap Open Penalty: the penalty for opening a gap in the alignment.

2) Gap extension penalty: the penalty for extending a gap by 1 residue.

3) Protein weight matrix: the scoring table which describes the similarity

of each amino acid to each other.

4) DNA weight matrix: the scores assigned to matches and mismatches (including

IUB ambiguity codes).

FAST/APPROXIMATE alignment parameters:

These similarity scores are calculated from fast, approximate, global align-

ments, which are controlled by 4 parameters. 2 techniques are used to make

these alignments very fast: 1) only exactly matching fragments (k-tuples) are

considered; 2) only the 'best' diagonals (the ones with most k-tuple matches)

are used.

K-TUPLE SIZE: This is the size of exactly matching fragment that is used.

INCREASE for speed (max= 2 for proteins; 4 for DNA), DECREASE for sensitivity.

For longer sequences (e.g. >1000 residues) you may need to increase the default.

GAP PENALTY: This is a penalty for each gap in the fast alignments. It has

little affect on the speed or sensitivity except for extreme values.

TOP DIAGONALS: The number of k-tuple matches on each diagonal (in an imaginary

dot-matrix plot) is calculated. Only the best ones (with most matches) are

used in the alignment. This parameter specifies how many. Decrease for speed;

increase for sensitivity.

WINDOW SIZE: This is the number of diagonals around each of the 'best'

diagonals that will be used. Decrease for speed; increase for sensitivity.

>>HELP 4<< Help for multiple alignment parameters

These parameters control the final multiple alignment. This is the core of

the program and the details are complicated. To fully understand the use

of the parameters and the scoring system, you will have to refer to the

documentation.

Each step in the final multiple alignment consists of aligning two alignments

or sequences. This is done progressively, following the branching order in

the GUIDE TREE. The basic parameters to control this are two gap penalties and

the scores for various identical/non-indentical residues.

1) and 2) The GAP PENALTIES are set by menu items 1 and 2. These control the

cost of opening up every new gap and the cost of every item in a gap.

Increasing the gap opening penalty will make gaps less frequent. Increasing

the gap extension penalty will make gaps shorter. Terminal gaps are not

penalised.

3) The DELAY DIVERGENT SEQUENCES switch, delays the alignment of the most

distantly related sequences until after the most closely related sequences have

been aligned. The setting shows the percent identity level required to delay

the addition of a sequence; sequences that are less identical than this level

to any other sequences will be aligned later.

4) The TRANSITION WEIGHT gives transitions (A <--> G or C <--> T

i.e. purine-purine or pyrimidine-pyrimidine substitutions) a weight between 0

and 1; a weight of zero means that the transitions are scored as mismatches,

while a weight of 1 gives the transitions the match score. For distantly related

DNA sequences, the weight should be near to zero; for closely related sequences

it can be useful to assign a higher score.

5) PROTEIN WEIGHT MATRIX leads to a new menu where you are offered a

choice of weight matrices. The default for proteins is the BLOSUM series of

matrices by Jorja and Steven Henikoff. Note, a series is used! The actual

matrix that is used depends on how similar the sequences to be aligned at this

alignment step are. Different matrices work differently at each

evolutionary distance.

6) DNA WEIGHT MATRIX leads to a new menu where a single matrix (not a series)

can be selected. The default is the matrix used by BESTFIT for comparison of

nucleic acid sequences.

Further help is offered in the weight matrix menu.

7) In the weight matrices, you can use negative as well as positive values if

you wish, although the matrix will be automatically adjusted to all positive

scores, unless the NEGATIVE MATRIX option is selected.

8) PROTEIN GAP PARAMETERS displays a menu allowing you to set some Gap Penalty

options which are only used in protein alignments.

>>HELP 5<< Help for protein gap parameters.

1) RESIDUE SPECIFIC PENALTIES are amino acid specific gap penalties that reduce

or increase the gap opening penalties at each position in the alignment or

sequence. See the documentation for details. As an example, positions that

are rich in glycine are more likely to have an adjacent gap than positions that

are rich in valine.

2) 3) HYDROPHILIC GAP PENALTIES are used to increase the chances of a gap within

a run (5 or more residues) of hydrophilic amino acids; these are likely to

be loop or random coil regions where gaps are more common. The residues that

are "considered" to be hydrophilic are set by menu item 3.

4) GAP SEPARATION DISTANCE tries to decrease the chances of gaps being

too close to each other. Gaps that are less than this distance apart

are penalised more than other gaps. This does not prevent close gaps;

it makes them less frequent, promoting a block-like appearance of the alignment.

5) END GAP SEPARATION treats end gaps just like internal gaps for the purposes

of avoiding gaps that are too close (set by GAP SEPARATION DISTANCE above).

If you turn this off, end gaps will be ignored for this purpose. This is

useful when you wish to align fragments where the end gaps are not biologically

meaningful.

>>HELP 6<< Help for choosing protein/DNA weight matrices

For protein alignments, you use a weight matrix to determine the similarity of

non-identical amino acids. For example, Tyr aligned with Phe is usually judged

to be 'better' than Tyr aligned with Pro.

There are three 'in-built' series of weight matrices offered. Each consists

of several matrices which work differently at different evolutionary distances.

To see the exact details, read the documentation. Crudely, we store several

matrices in memory, spanning the full range of amino acid distance (from

almost identical sequences to highly divergent ones). For very similar

sequences, it is best to use a strict weight matrix which only gives a high

score to identities and the most favoured conservative substitutions. For

more divergent sequences, it is appropriate to use "softer" matrices which

give a high score to many other frequent substitutions.

1) BLOSUM (Henikoff). These matrices appear to be the best available for

carrying out data base similarity (homology searches). The matrices used are:

Blosum80, 62, 45 and 30.

2) PAM (Dayhoff). These have been extremely widely used since the late '70s.

We use the PAM 120, 160, 250 and 350 matrices.

3) GONNET. These matrices were derived using almost the same

procedure as the Dayhoff one (above) but are much more up to date and are based

on a far larger data set. They appear to be more sensitive than the Dayhoff

series. We use the GONNET 40, 80, 120, 160, 250 and 350 matrices.

We also supply an identity matrix which gives a score of 1.0 to two identical

amino acids and a score of zero otherwise. This matrix is not very useful.

Alternatively, you can read in your own (just one matrix, not a series).

A new matrix can be read from a file on disk, if the filename consists only

of lower case characters. The values in the new weight matrix must be integers

and the scores should be similarities. You can use negative as well as positive

values if you wish, although the matrix will be automatically adjusted to all

positive scores.

For DNA, a single matrix (not a series) is used. Two hard-coded matrices are

available:

1) IUB. This is the default scoring matrix used by BESTFIT for the comparison

of nucleic acid sequences. X's and N's are treated as matches to any IUB

ambiguity symbol. All matches score 1.9; all mismatches for IUB symbols score 0.

2) CLUSTALW(1.6). The previous system used by ClustalW, in which matches score

1.0 and mismatches score 0. All matches for IUB symbols also score 0.

INPUT FORMAT The format used for a new matrix is the same as the BLAST program.

Any lines beginning with a # character are assumed to be comments. The first

non-comment line should contain a list of amino acids in any order, using the

1 letter code, followed by a * character. This should be followed by a square

matrix of integer scores, with one row and one column for each amino acid. The

last row and column of the matrix (corresponding to the * character) contain

the minimum score over the whole matrix.

>>HELP 7<< Help for output format options.

Five output formats are offered. You can choose more than one (or all 5 if

you wish).

CLUSTAL format output is a self explanatory alignment format. It shows the

sequences aligned in blocks. It can be read in again at a later date to

(for example) calculate a phylogenetic tree or add a new sequence with a

profile alignment.

GCG output can be used by any of the GCG programs that can work on multiple

alignments (e.g. PRETTY, PROFILEMAKE, PLOTALIGN). It is the same as the GCG

.msf format files (multiple sequence file); new in version 7 of GCG.

PHYLIP format output can be used for input to the PHYLIP package of Joe

Felsenstein. This is an extremely widely used package for doing every

imaginable form of phylogenetic analysis (MUCH more than the the modest intro-

duction offered by this program).

NBRF/PIR: this is the same as the standard PIR format with ONE ADDITION. Gap

characters "-" are used to indicate the positions of gaps in the multiple

alignment. These files can be re-used as input in any part of clustal that

allows sequences (or alignments or profiles) to be read in.

GDE: this format is used by the GDE package of Steven Smith.

GDE OUTPUT CASE: sequences in GDE format may be written in either upper or

lower case.

CLUSTALW SEQUENCE NUMBERS: residue numbers may be added to the end of the

alignment lines in clustalw format.

OUTPUT ORDER is used to control the order of the sequences in the output

alignments. By default, the order corresponds to the order in which the

sequences were aligned (from the guide tree/dendrogram), thus automatically

grouping closely related sequences. This switch can be used to set the order

to the same as the input file.

PARAMETER OUTPUT: This option allows you to save all your parameter settings

in a parameter file. This file can be used subsequently to rerun ClustalW

using the same parameters.

>>HELP 8<< Help for secondary structure options

The use of secondary structure-based penalties has been shown to improve

the accuracy of multiple alignment. Therefore CLUSTAL W now allows gap penalty

masks to be supplied with the input sequences. The masks work by raising gap

penalties in specified regions (typically secondary structure elements) so that

gaps are preferentially opened in the less well conserved regions (typically

surface loops).

Options 1 and 2 control whether the input secondary structure information

or gap penalty masks will be used.

Option 3 controls whether the secondary structure and gap penalty masks should

be included in the output alignment.

Options 4 and 5 provide the value for raising the gap penalty at core Alpha

Helical (A) and Beta Strand (B) residues. In CLUSTAL format, capital residues

denote the A and B core structure notation. Basic gap penalties are multiplied

by the amount specified.

Option 6 provides the value for the gap penalty in Loops. By default this

penalty is not raised. In CLUSTAL format, loops are specified by "." in the

secondary structure notation.

Option 7 provides the value for setting the gap penalty at the ends of

secondary structures. Ends of secondary structures are observed to grow

and/or shrink in related structures. Therefore by default these are given

intermediate values, lower than the core penalties. All secondary structure

read in as lower case in CLUSTAL format gets the reduced terminal penalty.

Options 8 and 9 specify the range of structure termini for the intermediate

penalties. In the alignment output, these are indicated as lower case.

For Alpha Helices, by default, the range spans the end helical turn. For

Beta Strands, the default range spans the end residue and the adjacent loop

residue, since sequence conservation often extends beyond the actual H-bonded

Beta Strand.

CLUSTAL W can read the masks from SWISS-PROT, CLUSTAL or GDE format

input files. For many 3-D protein structures, secondary structure information

is recorded in the feature tables of SWISS-PROT database entries. You

should always check that the assignments are correct - some are quite

inaccurate. CLUSTAL W looks for SWISS-PROT HELIX and STRAND assignments e.g.

FT HELIX 100 115

FT STRAND 118 119

The structure and penalty masks can also be read from CLUSTAL alignment format

as comment lines beginning "!SS_" or "!GM_" e.g.

!SS_HBA_HUMA ..aaaAAAAAAAAAAaaa.aaaAAAAAAAAAAaaaaaaAaaa.........aaaAAAAAA

!GM_HBA_HUMA 112224444444444222122244444444442222224222111111111222444444

HBA_HUMA VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK

Note that the mask itself is a set of numbers between 1 and 9 each of which is

assigned to the residue(s) in the same column below.

In GDE flat file format, the masks are specified as text and the names

must begin with SS_ or GM_.

Either a structure or penalty mask or both may be used. If both are included

in an alignment, the user will be asked which is to be used.

>>HELP 9<< Help for profile and structure alignments

By PROFILE ALIGNMENT, we mean alignment using existing alignments. Profile

alignments allow you to store alignments of your favourite sequences

and add new sequences to them in small bunches at a time. A profile

is simply an alignment of one or more sequences (e.g. an alignment output

file from CLUSTAL W). Each input can be a single sequence. One or both sets

of input sequences may include secondary structure assignments or gap

penalty masks to guide the alignment.

The profiles can be in any of the allowed input formats with "-" characters

used to specify gaps (except for MSF/RSF where "." is used).

You have to specify the 2 profiles by choosing menu items 1 and 2 and giving

2 file names. Then Menu item 3 will align the 2 profiles to each other.

Secondary structure masks in either profile can be used to guide the alignment.

Menu item 4 will take the sequences in the second profile and align them to

the first profile, 1 at a time. This is useful to add some new sequences to

an existing alignment, or to align a set of sequences to a known structure.

In this case, the second profile need not be pre-aligned.

The alignment parameters can be set using menu items 5, 6 and 7.

These are EXACTLY the same parameters as used by the general, automatic

multiple alignment procedure. The general multiple alignment procedure is

simply a series of profile alignments. Carrying out a series of profile

alignments on larger and larger groups of sequences, allows you to manually

build up a complete alignment, if necessary editing intermediate alignments.

SECONDARY STRUCTURE OPTIONS. Menu Option 0 allows you to set secondary structure

parameters. If a solved structure is available, it can be used to guide the

alignment by raising gap penalties within secondary structure elements, so

that gaps will preferentially be inserted into unstructured surface loop

regions. Alternatively, a user-specified gap penalty mask can be supplied for

a similar purpose.

A gap penalty mask is a series of numbers between 1 and 9, one per position in

the alignment. Each number specifies how much the gap opening penalty is to be

raised at that position (raised by multiplying the basic gap opening penalty

by the number) i.e. a mask figure of 1 at a position means no change

in gap opening penalty; a figure of 4 means that the gap opening penalty is

four times greater at that position, making gaps 4 times harder to open.

The format for gap penalty masks and secondary structure masks is explained

in the help under option 0 (secondary structure options).

>>HELP 10<< Help for phylogenetic trees

1) Before calculating a tree, you must have an ALIGNMENT in memory. This can be

input in any format or you should have just carried out a full multiple

alignment and the alignment is still in memory. Remember YOU MUST ALIGN THE

SEQUENCES FIRST!!!!

The method used is the NJ (Neighbour Joining) method of Saitou and Nei. First

you calculate distances (percent divergence) between all pairs of sequence from

a multiple alignment; second you apply the NJ method to the distance matrix.

2) EXCLUDE POSITIONS WITH GAPS? With this option, any alignment positions

where ANY of the sequences have a gap will be ignored. This means that 'like'

will be compared to 'like' in all distances. It also, automatically throws

away the most ambiguous parts of the alignment, which are concentrated around

gaps (usually). The disadvantage is that you may throw away much of

the data if there are many gaps.

3) CORRECT FOR MULTIPLE SUBSTITUTIONS? For small divergence (say <10%) this

option makes no difference. For greater divergence, this option corrects

for the fact that observed distances underestimate actual evolutionary dist-

ances. This is because, as sequences diverge, more than one substitution will

happen at many sites. However, you only see one difference when you look at the

present day sequences. Therefore, this option has the effect of stretching

branch lengths in trees (especially long branches). The corrections used here

(for DNA or proteins) are both due to Motoo Kimura. See the documentation for

details.

For VERY divergent sequences, the distances cannot be reliably

corrected. You will be warned if this happens. Even if none of the distances

in a data set exceed the reliable threshold, if you bootstrap the data,

some of the bootstrap distances may randomly exceed the safe limit.

4) To calculate a tree, use option 4 (DRAW TREE NOW). This gives an UNROOTED

tree and all branch lengths. The root of the tree can only be inferred by

using an outgroup (a sequence that you are certain branches at the outside

of the tree .... certain on biological grounds) OR if you assume a degree

of constancy in the 'molecular clock', you can place the root in the 'middle'

of the tree (roughly equidistant from all tips).

5) BOOTSTRAPPING is a method for deriving confidence values for the groupings in

a tree (first adapted for trees by Joe Felsenstein). It involves making N

random samples of sites from the alignment (N should be LARGE, e.g. 500 - 1000);

drawing N trees (1 from each sample) and counting how many times each grouping

from the original tree occurs in the sample trees. You must supply a seed

number for the random number generator. Different runs with the same seed

will give the same answer. See the documentation for details.

6) OUTPUT FORMATS: three different formats are allowed. None of these

displays the tree visually. You must make the tree yourself (on paper)

using the results OR get the PHYLIP package and use the tree drawing facilities

there. (Get the PHYLIP package anyway if you are interested in trees).

>>HELP 11<< Help for format of phylogenetic tree output

Three output formats are offered: 1) Clustal, 2) Phylip, 3) Just the distances.

None of these formats displays the results graphically. To see a graphic

representation, get the PHYLIP package and use format 2) below. It can be

imported into the PHYLIP programs RETREE, DRAWTREE and DRAWGRAM and displayed

graphically.

1) Clustal format output.

This format is verbose and lists all of the distances between the sequences

and the number of alignment positions used for each. The tree is described

at the end of the file. It lists the sequences that are joined at each

alignment step and the branch lengths. After two sequences are joined, it is

referred to later as a NODE. The number of a NODE is the number of the

lowest sequence in that NODE.

2) Phylip format output.

This format is the New Hampshire format, used by many phylogenetic analysis

packages. It consists of a series of nested parentheses, describing the

branching order, with the sequence names and branch lengths. It can

be used by the RETREE, DRAWGRAM and DRAWTREE programs of the PHYLIP

package to see the trees graphically. This is the same foormat used during

multiple alignment for the guide trees.

Some other packages that can read and display New Hampshire format are

TreeTool, TreeView, Phylowin and NJPlot.

3) The distances only.

This format just outputs a matrix of all the pairwise distances in a format

that can be used by the Phylip package. It used to be useful when one

could not produce distances from protein sequences in the Phylip package but

is now redundant (Protdist of Phylip 3.5 now does this).

4) TOGGLE PHYLIP BOOTSTRAP POSITIONS

By default, the bootstrap values are placed on the nodes of the phylip format

output tree. This is inaccurate as the bootstrap values should be associated

with the tree branches and not the nodes. However, this format can be read and

displayed by TreeTool, TreeView and Phylowin. An option is available to

correctly place the bootstrap values on the branches with which they are

associated.

>>HELP 12<< Help for choosing protein/DNA weight matrices