ChangeLog - FASTA v34

$Name:  $ - $Id: changes_v34.html,v 1.2 2007/08/09 20:34:08 wrp Exp $

May 28, 2007

Small modification for GCG ASCII (libtype=5) header line.

October 6, 2006 CVS fa34t26b3

New Windows programs available using Intel C++ compiler. First threaded programs for Windows; first SSE2 acceleration of SSEARCH for Windows.

July 18, 2006 CVS fa34t26b2

More powerful environment variable substitutions for FASTLIBS files. The library file name parsing programs now provide the option for environment variable substitions. For example, SLIB2=/slib2 as an environment variable (e.g. export SLIB2=/slib2 for ksh and bash), then
fasta34 -q query.aa '${SLIB2}/swissprot.fa'  expands as expected.
While this is not important for command lines, where the Unix shell would expand things anyway, it is very helpful for various configuration files, such as files of file names, where:
<${SLIB2}/blast
swissprot.fa
now expands properly, and in FASTLIBS files the line:
NCBI/Blast Swissprot$0S${SLIB2}/blast/swissprot.fa
expands properly. Currently, Environment variable expansion only takes place for library file names, and the <directory in a file of file names.

July 2, 2006 fa34t26b0

This release provides an extremely efficient SSE2 implementation of the Smith-Waterman algorithm for the SSE2 vector instructions written by Michael Farrar (farrar.michael@gmail.com). The SSE code speeds up Smith-Waterman 8 - 10-fold in my tests, making it comparable to Eric Lindahl's Altivec code for the Apple/IBM G4/G5 architecture.

May 24, 2006 fa34t25d8

In addition, support for ASN.1 PSSM:2 files provided by the NCBI PSI-BLAST WWW site is included. This code will not work with iteration 0 PSSM's (which have no PSSM information). For ASN.1 PSSM's, which provide the matrix name (and in some cases the gap penalties), the scoring matrix and gap penalties are set appropriately if they were not specified on the command line. ASN.1 PSSM's are type 2:
ssearch34 -P "pssm.asn1 2" .....

May 18, 2006

Support for NCBI Blast formatdb databases has been expanded. The FASTA programs can now read some NCBI *.pal and *.nal files, which are used to specify subsets of databases. Specifically, the swissprot.00.pal and pdbaa.00.pal files are supported. FASTA supports files that refer to *.msk files (i.e. swissprot.00.pal refers to swissprot.00.msk); it does not currently support .pal files that simply list other .pal or database files (e.g. FASTA does not support nr.pal or swissprot.pal).

Nov 20, 2005

Changes to support asymmetric matrices - a scoring matrix read in from a file can be asymmetric. Default matrices are all symmetric.

Sept 2, 2005

The prss34 program has been modified to use the same display routines as the other search programs. To be more consistent with the other programs, the old "-w shuffle-window-size" is now "-v window-size". prss34/prfx34 will also show the optimal alignment for which the significance is calculated by using the "-A" option. Since the new program reports results exactly like other fasta/ssearch/fastxy34 programs, parsing for statistical significance is considerably different. The old format program can be make using "make prss34o".

May 5, 2005 CVS fa34t25d1

Modification to the -x option, so that both an "X:X" match score and an "X:not-X" mismatch score can be specified. (This score is also used give a positive score to a "*:*" match - the end of a reading frame, while giving a negative score to "*:not-*".

Jan 24, 2005

Include a new program, "print_pssm", which reads a blastpgp binary checkpoint file and writes out the frequency values as text. These values can be used with a new option with ssearch34(_t) and prss34, which provides the ability to read a text PSSM file. To specify a text PSSM, use the option -P "query.ckpt 1" where the "1" indicates a text, rather than a binary checkpoint file. "initfa.c" has also been modified to work with PSSM files with zero's in the in the frequency table. Presumably these positions (at the ends) do not provide information. (Jan 26, 2005) blastpgp actually uses BLOSUM62 values when zero frequencies are provided, so read_pssm() has been modified to use scoring matrix values for zero frequencies as well.

Nov 4-8, 2004

Incorporation of Erik Lindahl "anti-diagonal" Altivec code for Smith-Waterman, only. Altivec SSEARCH is now faster than FASTA for

Aug 25,26, 2004 CVS fa34t24b3

Small change in output format for p34comp* programs in ">>>query_file#1 string" line before alignments. This line is not present in the non-parallel versions - it would be better for them to be consistent.

Dec 10, 2003 CVS fa34t23b3

Cause default ktup to drop for short sequences. For protein < 50, ktup=1; for DNA < 20, 50, 100 ktup = 1, 2, 3, respectively.

Dec 7, 2003

A new option, "-U" is available for RNA sequence comparison. "-U" functions like "-n", indicating that the query is an RNA sequence. In addition, to account for "G:U" base pairs, "-U" modifies the scoring matrices so that a "G:A" match has the same score as a "G:G" match, and "T:C" match has the same score as a "T:T" match.

Nov 2, 2003

Support for more sophisticated display options. Previously, one could have only on "-m #" option, even though several of the options were orthogonal (-m 9c is independent of -m 1 and -m2, which is independent of -m 6 (HTML)). In particular -m 9c can be combined with -m 6, which can be very helpful for runs that need HTML output but can also exploit the encoding provided by -m 9c. The "-m 9" option now also allows "-m 9i", which shows the standard best score information, plus percent identity and alignment length.

Sept 25, 2003

A new option is available for annotating alignments. -V '@#?!' can be used to annotate sites in a sequence, e.g:
>GTM1_HUMAN ...
PMILGYWDIRGLAHAIRLLLEYTDS@S?YEEKKYT@MG
DAPDYDRS@QWLNEKFKLGLDFPNLPYLIDGAHKIT
might mark known and expected (S,T) phosphorylation sites. These symbols are then displayed on the query coordinate line:
               10        20    @?  30  @     40  @     50        60
GTM1_H PMILGYWDIRGLAHAIRLLLEYTDSSYEEKKYTMGDAPDYDRSQWLNEKFKLGLDFPNLP
       ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
gtm1_h PMILGYWDIRGLAHAIRLLLEYTDSSYEEKKYTMGDAPDYDRSQWLNEKFKLGLDFPNLP
               10        20        30        40        50        60
This annotation is mostly designed to display post-translational modifications detected by MassSpec with FASTS, but is also available with FASTA and SSEARCH.

June 16, 2003 version: fasta34t22

ssearch34 now supports PSI-BLAST PSSM/profiles. Currently, it only supports the "checkpoint" file produced by blastall, and only on certain architectures where byte-reordering is unnecessary. It has not been tested extensively with the -S option.
ssearch34 -P blast.ckpt -f -11 -g -1 -s BL62 query.aa library
Will use the frequency information in the blast.chkpt file to do a position specific scoring matrix (PSSM) search using the Smith-Waterman algorithm. Because ssearch34 calculates scores for each of the sequences in the database, we anticipate that PSSM ssearch34 statistics will be more reliable than PSI-Blast statistics. The Blast checkpoint file is mostly double precision frequency numbers, which are represented in a machine specific way. Thus, you must generate the checkpoint file on the same machine that you run ssearch34 or prss34 -P query.ckpt. To generate a checkpoint file, run:
blastpgp -j 2 -h 1e-6 -i query.fa -d swissprot -C query.ckpt -o /dev/null
(This searches swissprot for 2 iterations ("-j 2" using a E() threshold 1e-6 saving the resulting position specific frequencies in query.ckpt. Note that the original query.fa and query.ckpt must match.)

Apr 11, 2003 CVS fa34t21b3

Fixes for "-E" and "-F" with ssearch34, which was inadvertantly disabled.

A new option, "-t t", is available to specify that all the protein sequences have implicit termination codons "*" at the end. Thus, all protein sequences are one residue longer, and full length matches are extended one extra residue and get a higher score. For fastx34/tfastx34, this helps extend alignments to the very end in cases where there may be a mismatch at the C-terminal residues.

-m 9c has also been modified to indicate locations of termination codons ( *1).

Mar 17, 2003 CVS fa34t21b2

A new option on scoring matrices "-MS" (e.g. "BL50-MS") can be used to turn the I/L, K/Q identities on or off. Thus, to make "fastm34" use the isobaric identities, use "-s M20-MS". To turn them off for "fasts34", use "-s M20".

Jan 25, 2003

Add option "-J start:stop" to pv34comp*/mp34comp*. "-J x" used to allow one to start at query sequence "x"; now both start and stop can be specified.

Nov 14-22, 2002 CVS fa34t20b6

Include compile-time define (-DPGM_DOC) that causes all the fasta programs to provide the same command line echo that is provided by the PVM and MPI parallel programs. Thus, if you run the program:
fasta34_t -q -S gtt1_drome.aa /slib/swissprot 12
the first lines of output from FASTA will be:
# fasta34_t -q gtt1_drome.aa /slib/swissprot
 FASTA searches a protein or DNA sequence data bank
  version 3.4t20 Nov 10, 2002
 Please cite:
   W.R. Pearson & D.J. Lipman PNAS (1988) 85:2444-2448
This has been turned on by default in most FASTA Makefiles.

Aug 27, 2002

Modifications to mshowbest.c and drop*.c (and p2_workcomp.c, compacc.c, doinit.c, etc.) to provide more information about the alignment with the -m 9 option. There is now a "-m 9c" option, which displays an encoded alignment after the -m 9 alignment information. The encoding is a string of the form: "=#mat+#ins=#mat-#del=#mat". Thus, an alignment over 218 amino acids with no gaps (not necessarily 100% identical) would be =218. The alignment:
       10        20        30        40        50          60         70  
GT8.7  NVRGLTHPIRMLLEYTDSSYDEKRYTMGDAPDFDRSQWLNEKFKL--GLDFPNLPYL-IDGSHKITQ
       :.::  . :: ::  .   .:::         : .:    ::.:   .: : ..:.. :::  :..:
XURTG  NARGRMECIRWLLAAAGVEFDEK---------FIQSPEDLEKLKKDGNLMFDQVPMVEIDG-MKLAQ
               20        30                 40        50        60        
would be encoded: "=23+9=13-2=10-1=3+1=5". The alignment encoding is with respect to the beginning of the alignment, not the beginning of either sequence. The beginning of the alignment in either sequence is given by the an0/an1 values. This capability is particularly useful for [t]fast[xy], where it can be used to indicate frameshift positions "/#\#" compactly. If "-m 9c" is used, the "The best scores" title line includes "aln_code".

Aug 14, 2002 CVS tag fa34t20

Changes to nmgetlib.c to allow multiple query searches coming from STDIN, either through pipes or input redirection. Thus, the command
cat prot_test.lseg | fasta34 -q -S @ /seqlib/swissprot
produces 11 searches. If you use the multiple query functions, the query subset applies only to the first sequence. Unfortunately, it is not possible to search against a STDIN library, because the FASTA programs do not keep the entire library in memory and need to be able to re-read high-scoring library sequences. Since it is not possible to fseek() against STDIN, searching against a STDIN library is not possible.

Aug 5, 2002

fasts34(_t) and fastm34(_t) have been modified to allow searches with DNA sequences. This gives a new capability to search for DNA motifs, or to search for ordered or unordered DNA sequences spaced at arbitrary distances.

June 25, 2002

Modify the statistical estimation strategy to sample all the sequences in the database, not just the first 60,000. The histogram is still based only on the first 60,000 scores and lengths, though all scores an lengths are shown. The fit to the data may be better than the histogram indicates, but it should not be worse.

June 19, 2002

Added "-C #" option, where 6 <= # <= MAX_UID (20), to specify the length of the sequence name display on the alignment labels. Until now, only 6 characters were ever displayed. Now, up to MAX_UID characters are available.

Mar 16, 2002

Added create_seq_demo.sql, nt_to_sql.pl to show how to build an SQL protein sequence database that can be used with with the mySQL versions of the fasta34 programs. Once the mySQL seq_demo database has been installed, it can be searched using the command:
fasta34 -q mgstm1.aa "seq_demo.sql 16"
mysql_lib.c has been modified to remove the restriction that mySQL protein sequence unique identifiers be integers. This allows the program to be used with the PIRPSD database. The RANLIB() function call has been changed to include "libstr", to support SQL text keys. Due to the size of libstr[], unique ID's must be < MAX_UID (20) characters.

A "pirpsd.sql" file is available for searching the mySQL distribution of the PIRPSD database. PIRPSD is available from ftp://nbrfa.georgetown.edu/pir_databases/psd/mysql.