**** * ***** *** * * *****
* * * * * * * * * * *
* * * * * * * * * * *
* * * * * *** * * * *
* * * * * ***** * *
Version 1.0
for Windows
By Ramin Charles Nakisa
Hardware
~~~~~~~~
* Anything that runs Windows ie. 80n86 where n > 1.
* Maths coprocessor not required, but preferable!
Files
~~~~~
a2 2588 Rat nAChR alpha-2 subunit
a3 2382 Rat nAChR alpha-3 subunit
charge mat 780 Charge comparison matrix
dna35 mat 780 DNA +3/-5 comparison matrix
dna53 mat 780 DNA +5/-3 comparison matrix
dotplot exe 53248 The business!
dotplot txt 12258 The file you're reading now
egfr 5897 Human EGF receptor
pam250 mat 780 PAM250 protein comparison matrix
readseq exe 70625 Quickwin version of Uncle Don's Readseq
12 file(s) 150118 bytes
Oooooh, A New Program, I Want To Try It NOW!
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
For fast satisfaction, try comparing two neuronal nicotinic
acetylcholine receptor subunit sequences from swissprot saved under the
filenames a2 and a3. The interface is like any other Windows program.
Use the File popup menu to load the horizontal sequence. You get a
dialog box listing all the files in the current directory. Select a2,
then the program gives a SeqInfo dialog box, which tells you a little
bit about the sequence you just loaded ie. its name, length, sequence
format and whether it's DNA, RNA or protein. Click OK. Then repeat for
the vertical sequence, choosing a3.
As in the MSDOS version, there are two ways of calculating the dotplot,
described in detail below. For a quick fix, use the Identities option
on the Draw popup menu. The default parameters are a window size of 10
and a threshold of 6. This gives a diagonal line across your screen.
Try tweaking the dotplot by clicking on Parameters and setting the
threshold to 1. Pretty, isn't it? You can even see the three
transmembrane regions in the middle of the protein and the final
transmembrane at the C-terminus.
Try to find the internal repeats in a human epidermal growth factor
receptor. To do this, load the sequence called egfr as both the
horizontal and the vertical sequence. Then set the parameters as
window=20 and threshold=1. Load the PAM250.MAT score matrix using the
File popup menu "Open Matrix" option.
The feature table gives:
FT REPEAT 75 300 APPROXIMATE.
FT REPEAT 390 600 APPROXIMATE.
Window Size and Stringency
~~~~~~~~~~~~~~~~~~~~~~~~~~
The program will prompt you for a window size and a stringency. For the
simplest case, where the program puts a dot on the screen for every
identity, the window size is one and the stringency is one. This will
be very NOISY, as can be seen in this dotplot of two well-known
sequences.
C A P T A I N K I R K
C * C
A * * A
P * P
T * T
A * * A
I * * I
N * N
N * N
E E
M M
O O
C A P T A I N K I R K
The real homology we are looking for is CAPTAIN, but there are hits off
this main `diagonal'. We get around this problem by using a window,
where for each diagonal the number of hits must exceed a certain
threshold (or stringency). Here is the dotplot above with a window size
of 2 and a stringency of 2.
C A P T A I N K I R K
C * C
A * A
P * P
T * T
A * A
I * I
N * N
N N
E E
M M
O O
C A P T A I N K I R K
The noise is gone! The same applies to dotplots using a score matrix,
that is, the noise decreases for increasing window size and stringency, but
eventually the signal decreases too. Experiment.
Raison D'ˆtre
~~~~~~~~~~~~~
This program fills a niche in the PC molecular biology freeware/shareware
world. I decided to write it because dotplots are easily implemented on a
PC, not being too CPU intensive (unless the sequences to be compared are
large) and being fun to play around with if made interactive. The
windows version was fairly easy to write because windows lends itself to
graphics-oriented programs. The operating system does a lot of the work
for you, such as the mouse movement and the menus. There's just a lot
more setting up to do than there is for MSDOS. I didn't believe it when
I saw the Windows "Hello, World" program in my Microsoft C manual.
Hundreds of lines of code compared with two or three for MSDOS! One
major drawback of this version is that it does not allow you to point to
the dotplot and see what bits of sequence you are actually looking at.
I'll fix this as soon as my girlfriend gives me another weekend to myself!
The program owes a great deal to Dan Gilbert's amazingly good sequence
reading/writing module UREADSEQ.C available from his equally amazing molecular
biology server at Indiana. This module allows DOTPLOT to read the following
formats:
1. IG/Stanford 8. Pearson/Fasta
2. GenBank/GB 9. Zuker
3. NBRF/PIR 10. Olsen
4. EMBL 11. Phylip3.4/Phylip
5. GCG 12. Phylip3.3/Interleaved
6. DNAStrider 13. Plain/Raw
7. Fitch
The Section for Computer Bullies
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
I made the calculation of scores even faster by calculating scores
diagonal-by-diagonal rather than point-by-point. I created a local
variable in which the running score for each diagonal was stored and
then for each point on the diagonal added the new score from just beyond
the window and subtracted the score from just before the window. In C,
this was as follows
sum -= aa_sim( sequence1[i+k-window], sequence2[k], score_table );
sum += aa_sim( sequence1[i+k+window+1], sequence2[k+win2+1], score_table );
This was a much more efficient way of doing the window averaging.
You might think that modifying UREADSEQ for Windows was easy. WRONG!
The major difference between memory management under Windows and an ANSI
version of C is the memory management. Instead of using malloc/calloc
Windows needs handles to memory. Unfortunately the readseq function
allocates memory itself and returns a pointer to char. I had to change
the readseq function to return a handle to the memory block containing
the sequence. To get a pointer to the sequence you just lock the block
with LocalLock, and unlock it after you've finished with it. That way
Windows can shift the block about as it sees fit. I increased
kStartLength to 10000 so that readseq would never use realloc, which
would cause problems. It's a bit of a bodge, but it seems to work fine.
You can play around with the PAM matrix if you like. The format is
identical to the MSDOS version of dotplot. By default it looks like
X=0
C 12
S 0 2
T -2 1 3
P -3 1 0 6
A -2 1 1 1 2
G -3 1 0 -1 1 5
N -4 1 0 -1 0 0 2
D -5 0 0 -1 0 1 2 4
E -5 0 0 -1 0 0 1 3 4
Q -5 -1 -1 0 0 -1 1 2 2 4
H -3 -1 -1 0 -1 -2 2 1 1 3 6
R -4 0 -1 0 -2 -3 0 -1 -1 1 2 6
K -5 0 0 -1 -1 -2 1 0 0 1 0 3 5
M -5 -2 -1 -2 -1 -3 -2 -3 -2 -1 -2 0 0 6
I -2 -1 0 -2 -1 -3 -2 -2 -2 -2 -2 -2 -2 2 5
L -6 -3 -2 -3 -2 -4 -3 -4 -3 -2 -2 -3 -3 4 2 6
V -2 -1 0 -1 0 -1 -2 -2 -2 -2 -2 -2 -2 2 4 2 4
F -4 -3 -3 -5 -4 -5 -4 -6 -5 -5 -2 -4 -5 0 1 2 -1 9
W 0 -3 -3 -5 -3 -5 -2 -4 -4 -4 0 -4 -4 -2 -1 -1 -2 7 10
Y -8 -2 -5 -6 -6 -7 -4 -7 -7 -5 -3 2 -3 -4 -5 -2 -6 0 0 17
C S T P A G N D E Q H R K M I L V F W Y
This is just a score matrix. For example, if the pair of amino acids to
be compared are leucine and arginine, the score matrix above gives a score
of -3. An identity generally gives a large positive score (tyrosine-tyrosine
gives a score of 17) with the largest scores for the rare amino acids. The
matrix is not calculated according to physico-chemical properties of amino
acids, it is statistically derived from comparison of many related proteins.
Some people claim that the particular score matrix used makes a great deal of
difference in the database searches and alignments, but don't take their word
for it; you should play around with it yourself. If the score matrix makes
that much difference then maybe your sequence similarity is just a figment
of your crazed imagination...
Anyway, you can edit the PAM.MAT file. Just bear these things in mind:
* Don't interchange columns and rows. The letters are there for your
convenience, so that editing the matrix is easy. The program always
reads the matrix in the same way regardless of the letters.
* Use integers, preferably in the same range as the above matrix
ie. -8 to +17.
* Don't forget to back up the original PAM.MAT, or you could get into
a pickle!
I've included a CHARGE.MAT, DNA53.MAT and DNA35.MAT. The CHARGE.MAT
file scores identities as +5 if both amino acids are charged (D, E, R,
K). Non-charged residues are scored as 0. Opposite charges are scored
as -3 and identical charges with non-identical residues are scored as
+3. The DNA53.MAT scores +5 for an identity and -3 for different
nucleotides. You can probably guess what the DNA35.MAT does!
Grovelling Credits Section
~~~~~~~~~~~~~~~~~~~~~~~~~~
I think Dan Gilbert is a marvellous man. UREADSEQ is FAB.
In case you ever read this, Dan, next time you're in London drop in to
Imperial and I'll buy you a pint of Old Rosie at the Phoenix and Firkin.
* Copyright 1990 by d.g.gilbert
* biology dept., indiana university, bloomington, in 47405
* e-mail: gilbertd@bio.indiana.edu
*
* This program may be freely copied and used by anyone.
* Developers are encouraged to incorporate parts in their
* programs, rather than devise their own private sequence
* format.
*
* This should compile and run with any ANSI C compiler.
* Please advise me of any bugs, additions or corrections.
Thanks also for feedback on Dotplot 2.0 and 3.0 from
* Finn Drablos in Norway, who suggested a change in the cursor.
* Finn Drablos AGAIN for pointing out a bug with self-comparisons which
revealed a particularly sinister bug in the score matrix reader function.
* Francis Durst in France, who pointed out the emulator bug and understood
that programming is heavily influenced by a girlfriend's trips.
Desperate Plea for Recognition
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If you enjoyed using dotplot, please DON'T SEND ME ANY MONEY!
I don't want money.
If I did I wouldn't have started a PhD.
I want PRAISE! RECOGNITION! FAME! PRAISE (again)!
So, cite dotplot as follows:
Nakisa, R.C. (1993). DotPlot, a program for graphical comparison of
nucleic acid and protein sequences for IBM PC. Published
electronically on the Internet and available by anonymous ftp from
ftp.bio.indiana.edu.
Substitute the name of the ftp server or mail server that you used for
ftp.bio.indiana.edu, unless you got the program from Uncle Don!
Please send your flattering minutiae, ego boosters, gripes and suggested
improvements by EMAIL to
ramin@ic.ac.uk ................ for Internet people
Alternatively, SNAILMAIL:
Ramin Nakisa,
Biophysics Section,
The Blackett Laboratory,
Imperial College of Science, Technology and Medicine,
Prince Consort Road,
London SW7 2BZ
Great Britain. Tel: 071 589-5111 x 6729 FAX: 071 589-0191